Make pipeline faster - Githubissues

joewandy commented 4 years ago

Notes on some things that could be done to make the pre-processing pipeline faster. Will keep adding to this as I go.

Speed up get_chebi_id
- Seems that get_chebi_id is called many times in a loop inside preprocessing. Maybe we can speed this up.
- Get rid of this print https://github.com/kmcluskey/FlyMet/blob/master/met_explore/preprocessing.py#L120 -- DONE
Speed up construct_all_peak_df.
- Try to remove the double for-loops in https://github.com/kmcluskey/FlyMet/blob/master/met_explore/peak_selection.py#L137-L145, or parallelise it?
- Get rid of this print https://github.com/kmcluskey/FlyMet/blob/master/met_explore/peak_selection.py#L710 -- DONE
Speed up remove_duplicates by vectorising it?
Speed up populate_peaks_cmpds_annots to make it insert in batch.
Speed up populate_peaksamples to make it insert in batch.

joewandy commented 4 years ago

Seems that the slowest part so far is populate_peaksamples, especially when there are many samples.

Will change that later to do a bulk create using many=True, see https://stackoverflow.com/questions/43435247/creating-multiple-objects-with-one-request-in-django-and-django-rest-framework.

joewandy commented 4 years ago

Small fixes to make add_chebi_ids slightly faster: https://github.com/kmcluskey/FlyMet/commit/9ccda3e6deef532a9cb157cb70500b5ce706dccc and https://github.com/kmcluskey/FlyMet/commit/84e3a50d622bdb7c886c1b2029ecb63825fee0bc.

joewandy commented 4 years ago

Seems that the slowest part so far is populate_peaksamples, especially when there are many samples.

Will change that later to do a bulk create using many=True, see https://stackoverflow.com/questions/43435247/creating-multiple-objects-with-one-request-in-django-and-django-rest-framework.

Done the above in https://github.com/kmcluskey/FlyMet/commit/ccdc3016a9c3b2ad126f6e74dd4ab8f02c7c636e. Peak population is much much faster now.

kmcluskey / FlyMet

Make pipeline faster #45