Closed prihoda closed 2 years ago
hi @prihoda
we have many ideas and working solutions for this problem. i will add some examples to the documentation shortly.
in the mean time would you mind providing me with more specifics around the use case you have in mind? particularly what type of pipeline are you running (links would be helpful) and how large is the reference?
I'm thinking about integrating the BioPhi humanness evaluation report: https://github.com/Merck/BioPhi
It uses a 22GB reference database (SQLite) containing human antibody 9-mers. It's a single file stored on Zenodo.
hi @prihoda circling back, for a 22GB file, it is actually quite reasonable to download the entire file for each execution. As long as you are storing the file in an S3 bucket, we observe download speeds of 100-200MB/s. An upper bound on download time is under four minutes and probably closer to two
It would be helpful if you could link a github repository with your actual SDK code + I can comment there
Hi all, this is a very exciting piece of work.
I wonder how to handle cases where the workflow depends on large reference databases. I guess it's not optimal to download those upon each execution.
Can you provide an example?