metaspace2020 / Lithops-METASPACE

Lithops-based Serverless implementation of the METASPACE spatial metabolomics annotation pipeline
12 stars 4 forks source link

Segment centroids in chunks #43

Closed omerb01 closed 5 years ago

omerb01 commented 5 years ago

Regarding to huge2 and huge3 datasets, we need a different logic for the part that sorts all relevant centroids together. With this patch, any size of centroids database will be supported.

Notice that currently, this implementation works with PyWren version 1.0.17, I will submit another PR to adjust the project with the most up to date PyWren version and with its new features.

omerb01 commented 5 years ago

@LachlanStuart @intsco currently I get FAIL lines when I validate the big dataset, I'm trying to figure out what is wrong with my implementation, but I don't understand what does it mean:

2019-09-15 18:26:39,457 [ERROR] annotation-pipeline: Missing annotations: 492 (FAIL)
2019-09-15 18:26:39,457 [ERROR] annotation-pipeline: Incorrect spatial metric: 539 (FAIL)
2019-09-15 18:26:39,457 [ERROR] annotation-pipeline: Incorrect spectral metric: 552 (FAIL)
2019-09-15 18:26:39,458 [INFO] annotation-pipeline: Incorrect chaos metric: 0 (PASS)
2019-09-15 18:26:39,458 [ERROR] annotation-pipeline: Incorrect MSM: 542 (FAIL)
2019-09-15 18:26:39,458 [INFO] annotation-pipeline: FDR changed: 686 (PASS)
2019-09-15 18:26:39,458 [ERROR] annotation-pipeline: FDR changed significantly: 387 (FAIL)
2019-09-15 18:26:39,476 [ERROR] annotation-pipeline: Missing annotations extra info:
         formula adduct  chaos_ref  spatial_ref  ...  spatial  spectral  msm  fdr
2691   C10H10O2S    +Na   0.995377     0.004286  ...      NaN       NaN  NaN  NaN
437     C10H10O4    +Na   0.999098     0.404242  ...      NaN       NaN  NaN  NaN
1975     C10H12O    +Na   0.997347     0.018597  ...      NaN       NaN  NaN  NaN
1157  C10H13N5O4    +Na   0.997834     0.149348  ...      NaN       NaN  NaN  NaN
1973   C10H14N2O     +K   0.992771     0.019105  ...      NaN       NaN  NaN  NaN

[5 rows x 12 columns]

2019-09-15 18:26:39,481 [ERROR] annotation-pipeline: Incorrect spatial metric extra info:
       spatial  spatial_ref     error
650   0.799049     0.243515  0.555534
1745  0.433242     0.034039  0.399203
1345  0.431806     0.067442  0.364364
271   0.951845     0.592175  0.359670
453   0.727409     0.391915  0.335494

2019-09-15 18:26:39,487 [ERROR] annotation-pipeline: Incorrect spectral metric extra info:
      spectral  spectral_ref     error
2926  0.957045      0.449652  0.507393
2336  0.976196      0.484345  0.491851
1504  0.934691      0.444531  0.490160
2069  0.925759      0.441505  0.484254
2430  0.944817      0.463925  0.480892

2019-09-15 18:26:39,492 [ERROR] annotation-pipeline: Incorrect MSM extra info:
           msm   msm_ref     error
650   0.793913  0.239710  0.554203
1745  0.425289  0.026867  0.398422
271   0.946849  0.577045  0.369804
1345  0.422658  0.058727  0.363932
453   0.722017  0.381321  0.340696

2019-09-15 18:26:39,509 [ERROR] annotation-pipeline: FDR changed significantly extra info:
         formula adduct  chaos_ref  spatial_ref  ...  spectral  msm  fdr  fdr_error
2691   C10H10O2S    +Na   0.995377     0.004286  ...       NaN  NaN  NaN          2
437     C10H10O4    +Na   0.999098     0.404242  ...       NaN  NaN  NaN          4
1975     C10H12O    +Na   0.997347     0.018597  ...       NaN  NaN  NaN          3
1157  C10H13N5O4    +Na   0.997834     0.149348  ...       NaN  NaN  NaN          3
695     C10H14O2    +Na   0.998445     0.227863  ...       NaN  NaN  NaN          4

[5 rows x 13 columns]

2019-09-15 18:26:39,509 [ERROR] annotation-pipeline: 5 checks failed

maybe you can review it and see something wrong

LachlanStuart commented 5 years ago

@omerb01 I can't see anything obviously wrong from the code, and I won't have time today to do a thorough investigation, but possibly I can give some insight into what might be the cause.

The "Chaos" metric is only based on the first peak image. The fact that the test on the Chaos metric passed means that the metrics are being correctly connected back to molecular formulas, and that the first peak image for every annotation is correct.

The "Spectral" and "Spatial" metrics run on all peak images (up to 4). Normally, if these scores only decrease, it means that during the image processing either it looked in the wrong segment for specific peak images and found no data, or some peak images were incorrectly split across segments. However, in this case some values that actually higher than expected. I actually haven't seen this problem before - it's possible it's getting the wrong images for the 2nd-4th peak images, or it's possible it has the wrong int values for the 2nd-4th peaks causing it to give them higher metric scores.

Normally I would troubleshoot this by viewing all generated images for several annotations, and comparing them against METASPACE. The check_results function returns an object with a 'merged_results' dataframe. If you look for a row that has spectral != spectral_ref, then check the index, formula and adduct, you should be able to match it to the images from pipeline.get_images() with the index, and then you can find the METASPACE version here: https://metaspace2020.eu/annotations?ds=2016-09-21_16h06m53s&db=&fdr=0.5&mol=C10H10O4 (Change the mol parameter in the URL to match the formula, and make sure to click the item in the list that has the same adduct). In the "Diagnostics" panel you can find the 4 peak images.

omerb01 commented 5 years ago

@LachlanStuart just a minor question, do these peaks ensure that their mz values are always sorted? for example, lets assume we have formula_i: 123 and we have 4 peaks: 0, 1, 2, 3. so the mz values of these peaks are: mz(0) < mz(1) < mz(2) < mz(3) is it correct?

LachlanStuart commented 5 years ago

@omerb01 They are always sorted by mz as long as they are non-zero. The sorting is done in isocalc_wrapper.py:40.

However, for some formulas there are fewer than 4 peaks. For these cases we add additional peaks with their mz and int set to 0, so that we can assume that there are always 4 peaks. These additional 0-valued peaks are always added to the end of the list, so a formula might have the mz values [123, 124, 125, 0].

omerb01 commented 5 years ago

@LachlanStuart all tested with "big" dataset and ready for a review