Closed LachlanStuart closed 4 years ago
@omerb01 Thanks for the review.
* Please update "pywren-annotation-pipeline" notebook code as well.
Done.
I noticed that the numbers changed a bit
That's really strange. The new .csv files actually came from the same data source. They were just dumped to csv instead of pickle. Unfortunately the old API is completely gone, so there's no way for me to go back and check. If the benchmark 1 test still passes, then it should be fine. I think it's unlikely, but if somehow the old code was causing duplicate formulas, it will be clear when I do another comparison of the costs.
* regarding `pipeline.get_images()`
I only added the pywren code because it was taking too much time and memory to download the unfiltered images for huge4. I need get_images
for debugging, but it doesn't make sense to include it in any of the benchmarks because it doesn't actually match the behavior of the equivalent Serverful METASPACE pipeline stage, which converts the images to PNG and saves the results_df
to a database. I suggest we exclude get_images
from the stats, because it's only temporary code and it can't be used for comparison.
in addition, I think there is a minor bug in PyWren that deletes all cloud objects after the operation of the annotation function, we will fix that internally.
It seems this only happens with "data_cleaner": true
in the config.
@LachlanStuart I reran current master branch on "big" dataset, now it seems to be the same as before (around 60k metrics) but it fails on check_results()
:
2020-03-23 19:59:24,868 [ERROR] annotation-pipeline: Missing annotations: 14 (FAIL)
2020-03-23 19:59:24,868 [INFO] annotation-pipeline: Incorrect spatial metric: 0 (PASS)
2020-03-23 19:59:24,868 [INFO] annotation-pipeline: Incorrect spectral metric: 5 (PASS)
2020-03-23 19:59:24,868 [INFO] annotation-pipeline: Incorrect chaos metric: 0 (PASS)
2020-03-23 19:59:24,869 [INFO] annotation-pipeline: Incorrect MSM: 0 (PASS)
2020-03-23 19:59:24,869 [INFO] annotation-pipeline: FDR changed: 64 (PASS)
2020-03-23 19:59:24,869 [INFO] annotation-pipeline: FDR changed significantly: 9 (PASS)
2020-03-23 19:59:24,889 [ERROR] annotation-pipeline: Missing annotations extra info:
formula adduct chaos ... spectral_ref msm_ref fdr_ref
ion_i ...
NaN C7H10N2 +H NaN ... 0.978575 0.012890 0.50
NaN C7H12O2 +K NaN ... 0.953472 0.012372 0.50
NaN C7H13NO3 +K NaN ... 0.964919 0.004456 0.50
NaN C7H14O5 +Na NaN ... 0.993197 0.832082 0.05
NaN C7H15Cl2N2O3P +Na NaN ... 0.789780 0.011702 0.10
[5 rows x 12 columns]
2020-03-23 19:59:24,889 [ERROR] annotation-pipeline: 1 checks failed
{'merged_results': formula adduct chaos ... spectral_ref msm_ref fdr_ref
ion_i ...
5349849.0 C10H10O +Na 0.987151 ... 0.970307 0.005981 0.20
4585486.0 C10H10O2 +Na 0.998970 ... 0.976527 0.381983 0.05
2294716.0 C10H10O2S +Na 0.995377 ... 0.959262 0.004092 0.20
5159028.0 C10H10O2S2 +Na 0.986087 ... 0.946251 0.011946 0.10
2866927.0 C10H10O3 +Na 0.998779 ... 0.970688 0.083843 0.10
... ... ... ... ... ... ...
190394.0 C9H9NO2 +Na 0.996646 ... 0.973598 0.001281 0.20
2288405.0 C9H9NO3 +H 0.996897 ... 0.972755 0.049399 0.20
1144512.0 C9H9NO3 +Na 0.988434 ... 0.973728 0.005501 0.20
4391200.0 CH3O5P +Na 0.987805 ... 0.994333 0.020734 0.10
1718689.0 CH4N2O +K 0.994900 ... 0.978025 0.005133 0.50
[3049 rows x 12 columns], 'missing_results': formula adduct chaos ... spectral_ref msm_ref fdr_ref
ion_i ...
NaN C7H10N2 +H NaN ... 0.978575 0.012890 0.50
NaN C7H12O2 +K NaN ... 0.953472 0.012372 0.50
NaN C7H13NO3 +K NaN ... 0.964919 0.004456 0.50
NaN C7H14O5 +Na NaN ... 0.993197 0.832082 0.05
NaN C7H15Cl2N2O3P +Na NaN ... 0.789780 0.011702 0.10
NaN C8H10O5 +Na NaN ... 0.974511 0.015889 0.10
NaN C8H10S +K NaN ... 0.954763 0.005440 0.50
NaN C8H16N2O4S +Na NaN ... 0.783592 0.077055 0.10
NaN C8H7NO +Na NaN ... 0.989191 0.444907 0.05
NaN C9H17NO5 +K NaN ... 0.739612 0.161133 0.20
NaN C9H18O2 +Na NaN ... 0.977522 0.222603 0.05
NaN C9H21NO7P +Na NaN ... 0.969531 0.002814 0.20
NaN C9H6O5 +K NaN ... 0.953029 0.005638 0.50
NaN C9H8O4 +Na NaN ... 0.972199 0.023245 0.10
[14 rows x 12 columns], 'spatial_wrong': Empty DataFrame
Columns: [spatial, spatial_ref, error]
Index: [], 'spectral_wrong': spectral spectral_ref error
ion_i
2472037.0 0.977071 0.969428 0.007643
1135330.0 0.977224 0.969633 0.007591
2854943.0 0.977883 0.970511 0.007372
3037315.0 0.980123 0.973498 0.006625
923977.0 0.983130 0.977506 0.005624, 'chaos_wrong': Empty DataFrame
Columns: [chaos, chaos_ref, error]
Index: [], 'msm_wrong': Empty DataFrame
Columns: [msm, msm_ref, error]
Index: [], 'fdr_error': formula adduct chaos ... msm_ref fdr_ref fdr_error
ion_i ...
2868858.0 C10H18O3S +K 0.989025 ... 0.025889 0.2 1
4205259.0 C10H18O5 +K 0.999479 ... 0.790682 0.2 1
1532938.0 C10H20N6O4 +H 0.999678 ... 0.733073 0.2 1
3256031.0 C11H17NO3 +Na 0.971815 ... 0.006601 0.1 1
5548632.0 C11H20O7 +K 0.999328 ... 0.696685 0.2 1
... ... ... ... ... ... ...
1907656.0 C9H18OS +Na 0.981548 ... 0.006613 0.1 1
NaN C9H21NO7P +Na NaN ... 0.002814 0.2 2
NaN C9H6O5 +K NaN ... 0.005638 0.5 1
NaN C9H8O4 +Na NaN ... 0.023245 0.1 3
2669422.0 C9H9N +H 0.999379 ... 0.737288 0.2 1
[64 rows x 13 columns]}
@omerb01 That's strange. Have you rerun build_database
and calculate_centroids
since updating? Is it possible that PyWren is picking up old files in your bucket?
I deleted all files in my bucket and reran the pipeline, and it passed, but I only had 10k results:
> len(results_df)
10879
> checked_results = pipeline.check_results()
Unauthorized. Only public but not private datasets will be accessible.
2020-03-24 15:03:40,854 [INFO] annotation-pipeline: Missing annotations: 0 (PASS)
2020-03-24 15:03:40,854 [INFO] annotation-pipeline: Incorrect spatial metric: 0 (PASS)
2020-03-24 15:03:40,854 [INFO] annotation-pipeline: Incorrect spectral metric: 5 (PASS)
2020-03-24 15:03:40,855 [INFO] annotation-pipeline: Incorrect chaos metric: 0 (PASS)
2020-03-24 15:03:40,855 [INFO] annotation-pipeline: Incorrect MSM: 0 (PASS)
2020-03-24 15:03:40,856 [INFO] annotation-pipeline: FDR changed: 54 (PASS)
2020-03-24 15:03:40,856 [INFO] annotation-pipeline: FDR changed significantly: 0 (PASS)
2020-03-24 15:03:40,856 [INFO] annotation-pipeline: All checks pass
This is in experiment-1-typical.ipynb
with input_config_big.json
and no local changes.
@LachlanStuart I committed to master branch a minor fix of the dbs path, I reran the entire workload and it still produces the same failure messages as I attached above. also, I'm not sure why do you observe ~10k metrics only while I observe ~60k in contrast.
One big change: the molecular DBs are no longer available through the METASPACE API, so I've just included them as CSV files that will be converted/uploaded with the existing
upload_mol_dbs_from_dir
call.For the rest of the changes, see the commit messages / comments.