Merge dev/evaluation into develop

emarinier commented 3 years ago

This pull request adds the ability to evaluate an assembly using machine learning techniques.

emarinier commented 3 years ago

File: proksee/database/random_forest_n50_contigcount_l50_totlen_gccontent.joblib is added, but appears to be empty. Is the file necessary?

asahaman commented 3 years ago

File: proksee/database/random_forest_n50_contigcount_l50_totlen_gccontent.joblib is added, but appears to be empty. Is the file necessary?

random_forest*.joblib is the machine learning model unreadable by ASCII but can be imported by python module joblib. It should be 3-4 mb in size and exist as a raw file. I am not sure why it shows empty. The random_forest*.joblib file evaluates vector (numpy) of genomic assembly attributes and returns probability of the assembly being good

emarinier commented 3 years ago

File: proksee/database/random_forest_n50_contigcount_l50_totlen_gccontent.joblib is added, but appears to be empty. Is the file necessary?

random_forest*.joblib is the machine learning model unreadable by ASCII but can be imported by python module joblib. It should be 3-4 mb in size and exist as a raw file. I am not sure why it shows empty. The random_forest*.joblib file evaluates vector (numpy) of genomic assembly attributes and returns probability of the assembly being good

Okay, I went digging in the files and it looks like it's there. It's strange that the review shows it as empty, but I guess it's not text-readable.

However, the name of the file should probably be simplified. It's a little bit long.

emarinier commented 3 years ago

We should try to resolve this testing error before merging:

tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
  /home/eric/miniconda3/envs/proksee/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
    return f(*args, **kwds)

tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_missing_genomic_attributes
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_invalid_genomic_attributes
  /home/eric/projects/proksee-cmd/.tox/py37/lib/python3.7/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.23.2 when using version 0.24.1. This might lead to breaking code or invalid results. Use at your own risk.
    UserWarning)

tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_missing_genomic_attributes
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_invalid_genomic_attributes
  /home/eric/projects/proksee-cmd/.tox/py37/lib/python3.7/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 0.23.2 when using version 0.24.1. This might lead to breaking code or invalid results. Use at your own risk.
    UserWarning)

emarinier commented 3 years ago

Also getting similar warnings when running ML code after integration into cmd_assemble.py:

/home/eric/miniconda3/envs/proksee/lib/python3.7/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.23.2 when using version 0.24.1. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
/home/eric/miniconda3/envs/proksee/lib/python3.7/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 0.23.2 when using version 0.24.1. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)

emarinier commented 3 years ago

I ran a few assemblies to test the ML evaluations. I'm seeing some probability values that don't make sense to me. Listeria monocytogenes, paired reads:

Fast Assembly

ML

The probability of the assembly being a good assembly is: 0.02.

Heuristic

WARNING: The N50 is somewhat smaller than expected: 144924
         The N50 lower bound is: 53101
PASS: The number of contigs is comparable to similar assemblies: 42
      The acceptable number of contigs range is: (14, 127)
WARNING: The L50 is somewhat larger than expected: 7
         The L50 upper bound is: 18
PASS: The assembly length is comparable to similar assemblies: 2990143
      The acceptable assembly length range is: (2891669, 3189001)

Expert Assembly

ML

The probability of the assembly being a good assembly is: 0.0.

Heuristic

PASS: The N50 is comparable to similar assemblies: 239945
      The acceptable N50 range is: (53101, 579512)
PASS: The number of contigs is comparable to similar assemblies: 36
      The acceptable number of contigs range is: (14, 127)
PASS: The L50 is comparable to similar assemblies: 5
      The acceptable L50 range is: (2, 18)
PASS: The assembly length is comparable to similar assemblies: 2998926
      The acceptable assembly length range is: (2891669, 3189001)

It seems really strange to me that the probability of this last assembly being "good" is so low, when everything fits within the range of previously observed RefSeq assembles for Listeria monocytogenes.

Any thoughts? Maybe a bug?

The quality measurements for the final assembly are:

species = Listeria monocytogenes
n50 = 239945
l50 = 5
num_contigs = 36
assembly_length = 2998926
gc_content = 37.82

asahaman commented 3 years ago

I ran a few assemblies to test the ML evaluations. I'm seeing some probability values that don't make sense to me. Listeria monocytogenes, paired reads:

Fast Assembly

ML

The probability of the assembly being a good assembly is: 0.02.

Heuristic

WARNING: The N50 is somewhat smaller than expected: 144924
         The N50 lower bound is: 53101
PASS: The number of contigs is comparable to similar assemblies: 42
      The acceptable number of contigs range is: (14, 127)
WARNING: The L50 is somewhat larger than expected: 7
         The L50 upper bound is: 18
PASS: The assembly length is comparable to similar assemblies: 2990143
      The acceptable assembly length range is: (2891669, 3189001)

Expert Assembly

ML

The probability of the assembly being a good assembly is: 0.0.

Heuristic

PASS: The N50 is comparable to similar assemblies: 239945
      The acceptable N50 range is: (53101, 579512)
PASS: The number of contigs is comparable to similar assemblies: 36
      The acceptable number of contigs range is: (14, 127)
PASS: The L50 is comparable to similar assemblies: 5
      The acceptable L50 range is: (2, 18)
PASS: The assembly length is comparable to similar assemblies: 2998926
      The acceptable assembly length range is: (2891669, 3189001)

It seems really strange to me that the probability of this last assembly being "good" is so low, when everything fits within the range of previously observed RefSeq assembles for Listeria monocytogenes.

Any thoughts? Maybe a bug?

The quality measurements for the final assembly are:

species = Listeria monocytogenes
n50 = 239945
l50 = 5
num_contigs = 36
assembly_length = 2998926
gc_content = 37.82

gc_content should be in fraction. Maybe try 0.3782 and see if its acceptable?

emarinier commented 3 years ago

Yup, it looks like GC content needs to be changed to be [0, 1]. New results look more plausible:

The probability of the assembly being a good assembly is: 0.88.

PASS: The N50 is comparable to similar assemblies: 239945
      The acceptable N50 range is: (53101, 579512)
PASS: The number of contigs is comparable to similar assemblies: 36
      The acceptable number of contigs range is: (14, 127)
PASS: The L50 is comparable to similar assemblies: 5
      The acceptable L50 range is: (2, 18)
PASS: The assembly length is comparable to similar assemblies: 2998926
      The acceptable assembly length range is: (2891669, 3189001)

asahaman commented 3 years ago

We should try to resolve this testing error before merging:

tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
  /home/eric/miniconda3/envs/proksee/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
    return f(*args, **kwds)

tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_missing_genomic_attributes
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_invalid_genomic_attributes
  /home/eric/projects/proksee-cmd/.tox/py37/lib/python3.7/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.23.2 when using version 0.24.1. This might lead to breaking code or invalid results. Use at your own risk.
    UserWarning)

tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_evaluate_probability
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_missing_genomic_attributes
tests/test_machine_learning_evaluator.py::TestMachineLearningEvaluator::test_invalid_genomic_attributes
  /home/eric/projects/proksee-cmd/.tox/py37/lib/python3.7/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 0.23.2 when using version 0.24.1. This might lead to breaking code or invalid results. Use at your own risk.
    UserWarning)

Suppressed benign warnings by using warnings library and few changes: with warnings.catch_warnings(): warnings.filterwarnings("ignore", message='numpy.ufunc size changed') warnings.filterwarnings("ignore", message='Trying to unpickle estimator')

emarinier commented 3 years ago

Suppressed benign warnings by using warnings library and few changes: with warnings.catch_warnings(): warnings.filterwarnings("ignore", message='numpy.ufunc size changed') warnings.filterwarnings("ignore", message='Trying to unpickle estimator')

Was there no way to solve the issue by specifying versions during the install process (environment.yml and tox.ini)?

emarinier commented 3 years ago

Ok, looks good to me. 👍

Probably ready to have reviewed by @ericenns

proksee-project / proksee-cmd

Merge dev/evaluation into develop #36