latchbio / latch

a python bioinformatics framework
https://docs.latch.bio
MIT License
162 stars 17 forks source link

Using large reference databases #10

Closed prihoda closed 2 years ago

prihoda commented 2 years ago

Hi all, this is a very exciting piece of work.

I wonder how to handle cases where the workflow depends on large reference databases. I guess it's not optimal to download those upon each execution.

Can you provide an example?

kennyworkman commented 2 years ago

hi @prihoda

we have many ideas and working solutions for this problem. i will add some examples to the documentation shortly.

in the mean time would you mind providing me with more specifics around the use case you have in mind? particularly what type of pipeline are you running (links would be helpful) and how large is the reference?

prihoda commented 2 years ago

I'm thinking about integrating the BioPhi humanness evaluation report: https://github.com/Merck/BioPhi

It uses a 22GB reference database (SQLite) containing human antibody 9-mers. It's a single file stored on Zenodo.

kennyworkman commented 2 years ago

hi @prihoda circling back, for a 22GB file, it is actually quite reasonable to download the entire file for each execution. As long as you are storing the file in an S3 bucket, we observe download speeds of 100-200MB/s. An upper bound on download time is under four minutes and probably closer to two

It would be helpful if you could link a github repository with your actual SDK code + I can comment there