metaspace2020 / Lithops-METASPACE

Lithops-based Serverless implementation of the METASPACE spatial metabolomics annotation pipeline
12 stars 4 forks source link

Fixes #60

Closed LachlanStuart closed 4 years ago

LachlanStuart commented 4 years ago

One big change: the molecular DBs are no longer available through the METASPACE API, so I've just included them as CSV files that will be converted/uploaded with the existing upload_mol_dbs_from_dir call.

For the rest of the changes, see the commit messages / comments.

LachlanStuart commented 4 years ago

@omerb01 Thanks for the review.

* Please update "pywren-annotation-pipeline" notebook code as well.

Done.

I noticed that the numbers changed a bit

That's really strange. The new .csv files actually came from the same data source. They were just dumped to csv instead of pickle. Unfortunately the old API is completely gone, so there's no way for me to go back and check. If the benchmark 1 test still passes, then it should be fine. I think it's unlikely, but if somehow the old code was causing duplicate formulas, it will be clear when I do another comparison of the costs.

* regarding `pipeline.get_images()` 

I only added the pywren code because it was taking too much time and memory to download the unfiltered images for huge4. I need get_images for debugging, but it doesn't make sense to include it in any of the benchmarks because it doesn't actually match the behavior of the equivalent Serverful METASPACE pipeline stage, which converts the images to PNG and saves the results_df to a database. I suggest we exclude get_images from the stats, because it's only temporary code and it can't be used for comparison.

in addition, I think there is a minor bug in PyWren that deletes all cloud objects after the operation of the annotation function, we will fix that internally.

It seems this only happens with "data_cleaner": true in the config.

omerb01 commented 4 years ago

@LachlanStuart I reran current master branch on "big" dataset, now it seems to be the same as before (around 60k metrics) but it fails on check_results():

2020-03-23 19:59:24,868 [ERROR] annotation-pipeline: Missing annotations: 14 (FAIL)
2020-03-23 19:59:24,868 [INFO] annotation-pipeline: Incorrect spatial metric: 0 (PASS)
2020-03-23 19:59:24,868 [INFO] annotation-pipeline: Incorrect spectral metric: 5 (PASS)
2020-03-23 19:59:24,868 [INFO] annotation-pipeline: Incorrect chaos metric: 0 (PASS)
2020-03-23 19:59:24,869 [INFO] annotation-pipeline: Incorrect MSM: 0 (PASS)
2020-03-23 19:59:24,869 [INFO] annotation-pipeline: FDR changed: 64 (PASS)
2020-03-23 19:59:24,869 [INFO] annotation-pipeline: FDR changed significantly: 9 (PASS)
2020-03-23 19:59:24,889 [ERROR] annotation-pipeline: Missing annotations extra info:
             formula adduct  chaos  ...  spectral_ref   msm_ref  fdr_ref
ion_i                               ...                                 
NaN          C7H10N2     +H    NaN  ...      0.978575  0.012890     0.50
NaN          C7H12O2     +K    NaN  ...      0.953472  0.012372     0.50
NaN         C7H13NO3     +K    NaN  ...      0.964919  0.004456     0.50
NaN          C7H14O5    +Na    NaN  ...      0.993197  0.832082     0.05
NaN    C7H15Cl2N2O3P    +Na    NaN  ...      0.789780  0.011702     0.10
[5 rows x 12 columns]
2020-03-23 19:59:24,889 [ERROR] annotation-pipeline: 1 checks failed
{'merged_results':               formula adduct     chaos  ...  spectral_ref   msm_ref  fdr_ref
ion_i                                   ...                                 
5349849.0     C10H10O    +Na  0.987151  ...      0.970307  0.005981     0.20
4585486.0    C10H10O2    +Na  0.998970  ...      0.976527  0.381983     0.05
2294716.0   C10H10O2S    +Na  0.995377  ...      0.959262  0.004092     0.20
5159028.0  C10H10O2S2    +Na  0.986087  ...      0.946251  0.011946     0.10
2866927.0    C10H10O3    +Na  0.998779  ...      0.970688  0.083843     0.10
               ...    ...       ...  ...           ...       ...      ...
190394.0      C9H9NO2    +Na  0.996646  ...      0.973598  0.001281     0.20
2288405.0     C9H9NO3     +H  0.996897  ...      0.972755  0.049399     0.20
1144512.0     C9H9NO3    +Na  0.988434  ...      0.973728  0.005501     0.20
4391200.0      CH3O5P    +Na  0.987805  ...      0.994333  0.020734     0.10
1718689.0      CH4N2O     +K  0.994900  ...      0.978025  0.005133     0.50
[3049 rows x 12 columns], 'missing_results':              formula adduct  chaos  ...  spectral_ref   msm_ref  fdr_ref
ion_i                               ...                                 
NaN          C7H10N2     +H    NaN  ...      0.978575  0.012890     0.50
NaN          C7H12O2     +K    NaN  ...      0.953472  0.012372     0.50
NaN         C7H13NO3     +K    NaN  ...      0.964919  0.004456     0.50
NaN          C7H14O5    +Na    NaN  ...      0.993197  0.832082     0.05
NaN    C7H15Cl2N2O3P    +Na    NaN  ...      0.789780  0.011702     0.10
NaN          C8H10O5    +Na    NaN  ...      0.974511  0.015889     0.10
NaN           C8H10S     +K    NaN  ...      0.954763  0.005440     0.50
NaN       C8H16N2O4S    +Na    NaN  ...      0.783592  0.077055     0.10
NaN           C8H7NO    +Na    NaN  ...      0.989191  0.444907     0.05
NaN         C9H17NO5     +K    NaN  ...      0.739612  0.161133     0.20
NaN          C9H18O2    +Na    NaN  ...      0.977522  0.222603     0.05
NaN        C9H21NO7P    +Na    NaN  ...      0.969531  0.002814     0.20
NaN           C9H6O5     +K    NaN  ...      0.953029  0.005638     0.50
NaN           C9H8O4    +Na    NaN  ...      0.972199  0.023245     0.10
[14 rows x 12 columns], 'spatial_wrong': Empty DataFrame
Columns: [spatial, spatial_ref, error]
Index: [], 'spectral_wrong':            spectral  spectral_ref     error
ion_i                                      
2472037.0  0.977071      0.969428  0.007643
1135330.0  0.977224      0.969633  0.007591
2854943.0  0.977883      0.970511  0.007372
3037315.0  0.980123      0.973498  0.006625
923977.0   0.983130      0.977506  0.005624, 'chaos_wrong': Empty DataFrame
Columns: [chaos, chaos_ref, error]
Index: [], 'msm_wrong': Empty DataFrame
Columns: [msm, msm_ref, error]
Index: [], 'fdr_error':               formula adduct     chaos  ...   msm_ref  fdr_ref  fdr_error
ion_i                                   ...                              
2868858.0   C10H18O3S     +K  0.989025  ...  0.025889      0.2          1
4205259.0    C10H18O5     +K  0.999479  ...  0.790682      0.2          1
1532938.0  C10H20N6O4     +H  0.999678  ...  0.733073      0.2          1
3256031.0   C11H17NO3    +Na  0.971815  ...  0.006601      0.1          1
5548632.0    C11H20O7     +K  0.999328  ...  0.696685      0.2          1
               ...    ...       ...  ...       ...      ...        ...
1907656.0     C9H18OS    +Na  0.981548  ...  0.006613      0.1          1
NaN         C9H21NO7P    +Na       NaN  ...  0.002814      0.2          2
NaN            C9H6O5     +K       NaN  ...  0.005638      0.5          1
NaN            C9H8O4    +Na       NaN  ...  0.023245      0.1          3
2669422.0       C9H9N     +H  0.999379  ...  0.737288      0.2          1
[64 rows x 13 columns]}
LachlanStuart commented 4 years ago

@omerb01 That's strange. Have you rerun build_database and calculate_centroids since updating? Is it possible that PyWren is picking up old files in your bucket?

I deleted all files in my bucket and reran the pipeline, and it passed, but I only had 10k results:

> len(results_df)
10879

> checked_results = pipeline.check_results()
Unauthorized. Only public but not private datasets will be accessible.

2020-03-24 15:03:40,854 [INFO] annotation-pipeline: Missing annotations: 0 (PASS)
2020-03-24 15:03:40,854 [INFO] annotation-pipeline: Incorrect spatial metric: 0 (PASS)
2020-03-24 15:03:40,854 [INFO] annotation-pipeline: Incorrect spectral metric: 5 (PASS)
2020-03-24 15:03:40,855 [INFO] annotation-pipeline: Incorrect chaos metric: 0 (PASS)
2020-03-24 15:03:40,855 [INFO] annotation-pipeline: Incorrect MSM: 0 (PASS)
2020-03-24 15:03:40,856 [INFO] annotation-pipeline: FDR changed: 54 (PASS)
2020-03-24 15:03:40,856 [INFO] annotation-pipeline: FDR changed significantly: 0 (PASS)
2020-03-24 15:03:40,856 [INFO] annotation-pipeline: All checks pass

This is in experiment-1-typical.ipynb with input_config_big.json and no local changes.

omerb01 commented 4 years ago

@LachlanStuart I committed to master branch a minor fix of the dbs path, I reran the entire workload and it still produces the same failure messages as I attached above. also, I'm not sure why do you observe ~10k metrics only while I observe ~60k in contrast.