glasgowcompbio / vimms

A programmable and modular LC/MS simulator in Python
MIT License
19 stars 6 forks source link

Evaluation code throws NaN when computing intensity proportion #271

Closed joewandy closed 1 year ago

joewandy commented 1 year ago

Would appreciate some help with this @mcbrider5002 as I can't easily follow the evaluation codes. I got the following output from RealEvaluator, which stops me from doing the analysis of our last proteomics experiment. Some division by 0, then we end up with NaN for the intensity proportion. This comes from the RealEvaluator class after passing it the boxes from OpenMS peak picking.. At first glance, the boxes look fine, but it's possible that maybe the boxes are not set up correctly, or they have no fragmentation hits?

/Users/joewandy/Work/git/vimms/vimms/scripts/../../vimms/Evaluation.py:210: RuntimeWarning:

invalid value encountered in divide

/Users/joewandy/Work/git/vimms/vimms/scripts/../../vimms/Evaluation.py:214: RuntimeWarning:

invalid value encountered in divide

/Users/joewandy/Work/git/vimms/vimms/scripts/../../vimms/Evaluation.py:218: RuntimeWarning:

invalid value encountered in divide

2023-08-21 22:31:16.181 | DEBUG    | mass_spec_utils.data_import.mzml:_load_file:166 - Loaded 50725 scans
Number of fragmentations: [36795]
Cumulative coverage: [222]
Cumulative coverage proportion: [0.2792452830188679]
Cumulative intensity proportion: [nan]
Cumulative intensity proportion of covered spectra: [0.6270934134534453]
Times covered: {0: 673, 1: 222}
Times fragmented: {0: 653, 1: 40, 2: 190, 3: 3, 4: 9}

Steps to reproduce:

  1. Go into your virtual environment
  2. Switch to proteomics branch in VIMMS repo
  3. Run this
$ python openms_evaluate.py /Users/joewandy/Library/CloudStorage/OneDrive-SharedLibraries-UniversityofGlasgow/Vinny\ Davies\ -\ CLDS\ Metabolomics\ Project/Experimental_Results/20230815_proteomics_initial/results/fullscan_hela_0.mzML /Users/joewandy/Library/CloudStorage/OneDrive-SharedLibraries-UniversityofGlasgow/Vinny\ Davies\ -\ CLDS\ Metabolomics\ Project/Experimental_Results/20230815_proteomics_initial/Instrument_method_files/HELA_soln_20ng_FTMS_HCD_1.mzML

Adjust the paths as necessary. This script openms_evaluate.py takes two argument:

mcbrider5002 commented 1 year ago

You should set min_intensity to whatever the minimum fragmentation intensity was when you call summarise. I've made a slight change to the code (https://github.com/glasgowcompbio/vimms/commit/d30af3102419732df4c40480739ede74a8d952e8) so it won't spit out NaNs but it'll still underestimate the performance of all methods if you don't set the parameter, I think.

There are boxes with zero observed max intensity (not fragmentation intensity). When you call add_info the boxes are overlaid over the .mzML file and it records the number of fragmentations, the maximum intensity and the maximum fragmentation intensity for that (box, mzML) pair and fills these out in the (n, m, 3) matrix chem_info. When you call evaluation_report, which your call to summarise implicitly does, it collapses this matrix into a report of summary stats. To compute intensity proportion it divides the max fragmentation intensity by the max total observed intensity. You don't normally have to exclude boxes below the fragmentation intensity for resimulated data by setting the min_intensity parameter, because we typically don't resimulate any RoI that would be below the minimum fragmentation intensity. But it is important for real data, as otherwise I think we count a score of 0 for each box that isn't possible to fragment due to the threshold, and we include that in the total.

joewandy commented 1 year ago

Fixed by ross