Closed omerb01 closed 5 years ago
@LachlanStuart @intsco currently I get FAIL
lines when I validate the big dataset, I'm trying to figure out what is wrong with my implementation, but I don't understand what does it mean:
2019-09-15 18:26:39,457 [ERROR] annotation-pipeline: Missing annotations: 492 (FAIL)
2019-09-15 18:26:39,457 [ERROR] annotation-pipeline: Incorrect spatial metric: 539 (FAIL)
2019-09-15 18:26:39,457 [ERROR] annotation-pipeline: Incorrect spectral metric: 552 (FAIL)
2019-09-15 18:26:39,458 [INFO] annotation-pipeline: Incorrect chaos metric: 0 (PASS)
2019-09-15 18:26:39,458 [ERROR] annotation-pipeline: Incorrect MSM: 542 (FAIL)
2019-09-15 18:26:39,458 [INFO] annotation-pipeline: FDR changed: 686 (PASS)
2019-09-15 18:26:39,458 [ERROR] annotation-pipeline: FDR changed significantly: 387 (FAIL)
2019-09-15 18:26:39,476 [ERROR] annotation-pipeline: Missing annotations extra info:
formula adduct chaos_ref spatial_ref ... spatial spectral msm fdr
2691 C10H10O2S +Na 0.995377 0.004286 ... NaN NaN NaN NaN
437 C10H10O4 +Na 0.999098 0.404242 ... NaN NaN NaN NaN
1975 C10H12O +Na 0.997347 0.018597 ... NaN NaN NaN NaN
1157 C10H13N5O4 +Na 0.997834 0.149348 ... NaN NaN NaN NaN
1973 C10H14N2O +K 0.992771 0.019105 ... NaN NaN NaN NaN
[5 rows x 12 columns]
2019-09-15 18:26:39,481 [ERROR] annotation-pipeline: Incorrect spatial metric extra info:
spatial spatial_ref error
650 0.799049 0.243515 0.555534
1745 0.433242 0.034039 0.399203
1345 0.431806 0.067442 0.364364
271 0.951845 0.592175 0.359670
453 0.727409 0.391915 0.335494
2019-09-15 18:26:39,487 [ERROR] annotation-pipeline: Incorrect spectral metric extra info:
spectral spectral_ref error
2926 0.957045 0.449652 0.507393
2336 0.976196 0.484345 0.491851
1504 0.934691 0.444531 0.490160
2069 0.925759 0.441505 0.484254
2430 0.944817 0.463925 0.480892
2019-09-15 18:26:39,492 [ERROR] annotation-pipeline: Incorrect MSM extra info:
msm msm_ref error
650 0.793913 0.239710 0.554203
1745 0.425289 0.026867 0.398422
271 0.946849 0.577045 0.369804
1345 0.422658 0.058727 0.363932
453 0.722017 0.381321 0.340696
2019-09-15 18:26:39,509 [ERROR] annotation-pipeline: FDR changed significantly extra info:
formula adduct chaos_ref spatial_ref ... spectral msm fdr fdr_error
2691 C10H10O2S +Na 0.995377 0.004286 ... NaN NaN NaN 2
437 C10H10O4 +Na 0.999098 0.404242 ... NaN NaN NaN 4
1975 C10H12O +Na 0.997347 0.018597 ... NaN NaN NaN 3
1157 C10H13N5O4 +Na 0.997834 0.149348 ... NaN NaN NaN 3
695 C10H14O2 +Na 0.998445 0.227863 ... NaN NaN NaN 4
[5 rows x 13 columns]
2019-09-15 18:26:39,509 [ERROR] annotation-pipeline: 5 checks failed
maybe you can review it and see something wrong
@omerb01 I can't see anything obviously wrong from the code, and I won't have time today to do a thorough investigation, but possibly I can give some insight into what might be the cause.
The "Chaos" metric is only based on the first peak image. The fact that the test on the Chaos metric passed means that the metrics are being correctly connected back to molecular formulas, and that the first peak image for every annotation is correct.
The "Spectral" and "Spatial" metrics run on all peak images (up to 4). Normally, if these scores only decrease, it means that during the image processing either it looked in the wrong segment for specific peak images and found no data, or some peak images were incorrectly split across segments. However, in this case some values that actually higher than expected. I actually haven't seen this problem before - it's possible it's getting the wrong images for the 2nd-4th peak images, or it's possible it has the wrong int
values for the 2nd-4th peaks causing it to give them higher metric scores.
Normally I would troubleshoot this by viewing all generated images for several annotations, and comparing them against METASPACE. The check_results
function returns an object with a 'merged_results'
dataframe. If you look for a row that has spectral != spectral_ref
, then check the index, formula
and adduct
, you should be able to match it to the images from pipeline.get_images()
with the index, and then you can find the METASPACE version here:
https://metaspace2020.eu/annotations?ds=2016-09-21_16h06m53s&db=&fdr=0.5&mol=C10H10O4 (Change the mol
parameter in the URL to match the formula
, and make sure to click the item in the list that has the same adduct
). In the "Diagnostics" panel you can find the 4 peak images.
@LachlanStuart just a minor question, do these peaks ensure that their mz values are always sorted? for example, lets assume we have formula_i: 123 and we have 4 peaks: 0, 1, 2, 3. so the mz values of these peaks are: mz(0) < mz(1) < mz(2) < mz(3) is it correct?
@omerb01 They are always sorted by mz as long as they are non-zero. The sorting is done in isocalc_wrapper.py:40
.
However, for some formulas there are fewer than 4 peaks. For these cases we add additional peaks with their mz
and int
set to 0
, so that we can assume that there are always 4 peaks. These additional 0
-valued peaks are always added to the end of the list, so a formula might have the mz
values [123, 124, 125, 0]
.
@LachlanStuart all tested with "big" dataset and ready for a review
Regarding to huge2 and huge3 datasets, we need a different logic for the part that sorts all relevant centroids together. With this patch, any size of centroids database will be supported.
Notice that currently, this implementation works with PyWren version 1.0.17, I will submit another PR to adjust the project with the most up to date PyWren version and with its new features.