should we not use the max target seqs parameter in our blast pipelines?

bradfordcondon commented 5 years ago

https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty833/5106166

Basically this parameter does not return the top hits, but rather the first hits that meet whatever cutoff you set.

mestato commented 5 years ago

Interesting and surprising. According to this github issue thread, the behavior is pervasive, including the e-value cutoff. I propose we go ahead and switch to Diamond, which is a lot faster, and from the manual: "--max-target-seqs/-k # The maximum number of target sequences per query to report alignments for (default=25). Setting this to 0 will report all alignments that were found."

So set the --max_target_seqs to 0 and leave everything else alone, then we'll have to write a python script to filter the (giant) xml.

On Wed, Sep 26, 2018 at 8:34 AM Bradford Condon notifications@github.com wrote:

https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty833/5106166

Basically this parameter does not return the top hits, but rather the first hits that meet whatever cutoff you set.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/statonlab/hardwoods_site/issues/409, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfA2lcOvC-_9xzep665GDKcrZjse2hGks5ue3RdgaJpZM4W6jt0 .

-- Margaret Staton Assistant Professor Department of Entomology and Plant Pathology 370 PBB, 2505 EJ Chapman Drive Knoxville, TN 37996-4560

864-506-4515 Mobile mstaton1@utk.edu

MattHuff commented 5 years ago

I've downloaded Diamond on the Staton server, and I'm running it to see how long it will take.

A few initial observations I've had:

I encountered issues downloading the files through git clone and the wget option listed in Diamond's manual, but installing it via conda produced no such issues. Given how the ACF is about conda environments, I'm going to give these earlier options another try when I install them there.
Diamond has its own format for database files, so existing libraries - such as the uniprot libraries - need to be updated to include the Diamond database files.
It is definitely much faster than NCBIBlast. I was having issues BLASTing a dataset to the trembl library on the ACF - no matter what time I put, the process always seemed to exceed it - but it finished in less than 30 minutes using Diamond.

I'll update this as I make more observations, including how different Diamond's XML format is from BLAST's.

statonlab / hardwoods_site

should we not use the max target seqs parameter in our blast pipelines? #409