Annotating a large batch of plasmid reads

mmcguffi / pLannotate

Webserver and command line tool for annotating engineered plasmids

GNU General Public License v3.0

97 stars 20 forks source link

Annotating a large batch of plasmid reads #26

Closed nh13 closed 1 year ago

nh13 commented 1 year ago

Hey @jeffreybarrick,

I wanted to get your thoughts before I make this PR, which will only take me a few hours. The idea would be to able to annotate a large batch of plasmid sequences, for example tens of thousands. This would be useful when QC'ing individual long-reads sequencing a colony of plasmids, prior to assembly of the various populations.

I would modify the annotate and accompanying methods to accept a batch of queries, versus a single one now, as well as explicitly set threading on the various blast/diamond/cmscan tools. I would then modify the main "batch" method to accept a FASTA or FASTQ that could have more than one sequence (I would use pysam for parsing, so a new dependency). I would also add a bit of caching (opt-in of course). Finally, I would create HTML and GBK files (when the cli commands are set) on a per-read basis for QC. I'd probably add a bit of progress logging too.

Thoughts?

jeffreybarrick commented 1 year ago

I don't see any major issues with these changes. Speeding things up for large batches would be great!

Here are a couple of thoughts:

If you can keep compatibility so that the internal Python methods work as they do now but ALSO work with batches/lists via functional polymorphism or checking the parameters, that would be nice.

I'd prefer to not add more dependencies, so consider using Biopython to read in the files. This would also be a more natural way of remaining compatible with processing an input file in another format (containing many GenBank records, for example).

mmcguffi commented 1 year ago

Yes, I almost certainly will not add any more dependencies -- pLannotate is already a long install using vanilla conda and sometimes even fails to solve the dependencies. If anything, I will remove dependencies (click has been particularly problematic).

Maybe I am misunderstanding something, but Im not entirely sure how this would increase throughput? pLannotate would still have to be run linearly on each entry/sample. Couldn't this functionality also be achieved by first expanding the multi-FASTA/GBK and then running pLannotate? Parallelism could be achieved with gnu parallel, workflow managers, etc

Im not entirely opposed to adding this, Im just not sure if I completely understand. There are other considerations like file name outputs as well.

nh13 commented 1 year ago

The idea is to speed up the database search portion of the code, whereby instead of iterating through each database and searching a single a single query (plasmid), we batch mulitple queries (e.g. 10s, 100s) in a single batch of reads. This speeds up the search versus query every read against every database separately. Hopefully this make sense.

Of course we can split the input reads into batches and parallelize, and I may just do that, but I do want each batch to not be a single query.

On another note, I am a very pleased with the conciseness of defopt (it uses the types in your function signature and the docstring to generate the args). Here's a PR with the change: https://github.com/barricklab/pLannotate/pull/29

nh13 commented 1 year ago

I am going to just do a benchmark where I run a batch of reads together, as well as one at a time, and prove to myself this makes a difference. Stay tuned.

nh13 commented 1 year ago

Looks like there really isn't too much of a speedup, except for when we have duplicate reads. The real advantage is supporting multiple records in a single FASTA, but I can use the API to do that myself :)