kmcluskey / FlyMet

A multi-omics web app for Drosophila tissues
MIT License
3 stars 0 forks source link

Make pipeline faster #45

Open joewandy opened 4 years ago

joewandy commented 4 years ago

Notes on some things that could be done to make the pre-processing pipeline faster. Will keep adding to this as I go.

  1. Speed up get_chebi_id
  2. Speed up construct_all_peak_df.
  3. Speed up remove_duplicates by vectorising it?
  4. Speed up populate_peaks_cmpds_annots to make it insert in batch.
  5. Speed up populate_peaksamples to make it insert in batch.
joewandy commented 4 years ago

Seems that the slowest part so far is populate_peaksamples, especially when there are many samples.

Will change that later to do a bulk create using many=True, see https://stackoverflow.com/questions/43435247/creating-multiple-objects-with-one-request-in-django-and-django-rest-framework.

joewandy commented 4 years ago

Small fixes to make add_chebi_ids slightly faster: https://github.com/kmcluskey/FlyMet/commit/9ccda3e6deef532a9cb157cb70500b5ce706dccc and https://github.com/kmcluskey/FlyMet/commit/84e3a50d622bdb7c886c1b2029ecb63825fee0bc.

joewandy commented 4 years ago

Seems that the slowest part so far is populate_peaksamples, especially when there are many samples.

Will change that later to do a bulk create using many=True, see https://stackoverflow.com/questions/43435247/creating-multiple-objects-with-one-request-in-django-and-django-rest-framework.

Done the above in https://github.com/kmcluskey/FlyMet/commit/ccdc3016a9c3b2ad126f6e74dd4ab8f02c7c636e. Peak population is much much faster now.