Closed emarinier closed 3 years ago
File: proksee/database/random_forest_n50_contigcount_l50_totlen_gccontent.joblib
is added, but appears to be empty. Is the file necessary?
File:
proksee/database/random_forest_n50_contigcount_l50_totlen_gccontent.joblib
is added, but appears to be empty. Is the file necessary?
random_forest*.joblib
is the machine learning model unreadable by ASCII but can be imported by python module joblib
. It should be 3-4 mb in size and exist as a raw file. I am not sure why it shows empty. The random_forest*.joblib
file evaluates vector (numpy) of genomic assembly attributes and returns probability of the assembly being good
File:
proksee/database/random_forest_n50_contigcount_l50_totlen_gccontent.joblib
is added, but appears to be empty. Is the file necessary?
random_forest*.joblib
is the machine learning model unreadable by ASCII but can be imported by python modulejoblib
. It should be 3-4 mb in size and exist as a raw file. I am not sure why it shows empty. Therandom_forest*.joblib
file evaluates vector (numpy) of genomic assembly attributes and returns probability of the assembly being good
Okay, I went digging in the files and it looks like it's there. It's strange that the review shows it as empty, but I guess it's not text-readable.
However, the name of the file should probably be simplified. It's a little bit long.
We should try to resolve this testing error before merging:
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
/home/eric/miniconda3/envs/proksee/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
return f(*args, **kwds)
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_missing_genomic_attributes
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_invalid_genomic_attributes
/home/eric/projects/proksee-cmd/.tox/py37/lib/python3.7/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.23.2 when using version 0.24.1. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_missing_genomic_attributes
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_invalid_genomic_attributes
/home/eric/projects/proksee-cmd/.tox/py37/lib/python3.7/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 0.23.2 when using version 0.24.1. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
Also getting similar warnings when running ML code after integration into cmd_assemble.py:
/home/eric/miniconda3/envs/proksee/lib/python3.7/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.23.2 when using version 0.24.1. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
/home/eric/miniconda3/envs/proksee/lib/python3.7/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 0.23.2 when using version 0.24.1. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
I ran a few assemblies to test the ML evaluations. I'm seeing some probability values that don't make sense to me. Listeria monocytogenes, paired reads:
Fast Assembly
ML
The probability of the assembly being a good assembly is: 0.02.
Heuristic
WARNING: The N50 is somewhat smaller than expected: 144924
The N50 lower bound is: 53101
PASS: The number of contigs is comparable to similar assemblies: 42
The acceptable number of contigs range is: (14, 127)
WARNING: The L50 is somewhat larger than expected: 7
The L50 upper bound is: 18
PASS: The assembly length is comparable to similar assemblies: 2990143
The acceptable assembly length range is: (2891669, 3189001)
Expert Assembly
ML
The probability of the assembly being a good assembly is: 0.0.
Heuristic
PASS: The N50 is comparable to similar assemblies: 239945
The acceptable N50 range is: (53101, 579512)
PASS: The number of contigs is comparable to similar assemblies: 36
The acceptable number of contigs range is: (14, 127)
PASS: The L50 is comparable to similar assemblies: 5
The acceptable L50 range is: (2, 18)
PASS: The assembly length is comparable to similar assemblies: 2998926
The acceptable assembly length range is: (2891669, 3189001)
It seems really strange to me that the probability of this last assembly being "good" is so low, when everything fits within the range of previously observed RefSeq assembles for Listeria monocytogenes.
Any thoughts? Maybe a bug?
The quality measurements for the final assembly are:
species = Listeria monocytogenes
n50 = 239945
l50 = 5
num_contigs = 36
assembly_length = 2998926
gc_content = 37.82
I ran a few assemblies to test the ML evaluations. I'm seeing some probability values that don't make sense to me. Listeria monocytogenes, paired reads:
Fast Assembly
ML
The probability of the assembly being a good assembly is: 0.02.
Heuristic
WARNING: The N50 is somewhat smaller than expected: 144924 The N50 lower bound is: 53101 PASS: The number of contigs is comparable to similar assemblies: 42 The acceptable number of contigs range is: (14, 127) WARNING: The L50 is somewhat larger than expected: 7 The L50 upper bound is: 18 PASS: The assembly length is comparable to similar assemblies: 2990143 The acceptable assembly length range is: (2891669, 3189001)
Expert Assembly
ML
The probability of the assembly being a good assembly is: 0.0.
Heuristic
PASS: The N50 is comparable to similar assemblies: 239945 The acceptable N50 range is: (53101, 579512) PASS: The number of contigs is comparable to similar assemblies: 36 The acceptable number of contigs range is: (14, 127) PASS: The L50 is comparable to similar assemblies: 5 The acceptable L50 range is: (2, 18) PASS: The assembly length is comparable to similar assemblies: 2998926 The acceptable assembly length range is: (2891669, 3189001)
It seems really strange to me that the probability of this last assembly being "good" is so low, when everything fits within the range of previously observed RefSeq assembles for Listeria monocytogenes.
Any thoughts? Maybe a bug?
The quality measurements for the final assembly are:
species = Listeria monocytogenes n50 = 239945 l50 = 5 num_contigs = 36 assembly_length = 2998926 gc_content = 37.82
gc_content should be in fraction. Maybe try 0.3782 and see if its acceptable?
Yup, it looks like GC content needs to be changed to be [0, 1]. New results look more plausible:
The probability of the assembly being a good assembly is: 0.88.
PASS: The N50 is comparable to similar assemblies: 239945
The acceptable N50 range is: (53101, 579512)
PASS: The number of contigs is comparable to similar assemblies: 36
The acceptable number of contigs range is: (14, 127)
PASS: The L50 is comparable to similar assemblies: 5
The acceptable L50 range is: (2, 18)
PASS: The assembly length is comparable to similar assemblies: 2998926
The acceptable assembly length range is: (2891669, 3189001)
We should try to resolve this testing error before merging:
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability /home/eric/miniconda3/envs/proksee/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject return f(*args, **kwds) tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_missing_genomic_attributes tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_invalid_genomic_attributes /home/eric/projects/proksee-cmd/.tox/py37/lib/python3.7/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.23.2 when using version 0.24.1. This might lead to breaking code or invalid results. Use at your own risk. UserWarning) tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_missing_genomic_attributes tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_invalid_genomic_attributes /home/eric/projects/proksee-cmd/.tox/py37/lib/python3.7/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 0.23.2 when using version 0.24.1. This might lead to breaking code or invalid results. Use at your own risk. UserWarning)
Suppressed benign warnings by using warnings
library and few changes:
with warnings.catch_warnings():
warnings.filterwarnings("ignore", message='numpy.ufunc size changed')
warnings.filterwarnings("ignore", message='Trying to unpickle estimator')
Suppressed benign warnings by using
warnings
library and few changes: with warnings.catch_warnings(): warnings.filterwarnings("ignore", message='numpy.ufunc size changed') warnings.filterwarnings("ignore", message='Trying to unpickle estimator')
Was there no way to solve the issue by specifying versions during the install process (environment.yml and tox.ini)?
Ok, looks good to me. 👍
Probably ready to have reviewed by @ericenns
This pull request adds the ability to evaluate an assembly using machine learning techniques.