mattb112885 / clusterDbAnalysis

ITEP - Integrated Toolkit for Exploration of microbial Pan-genomes
26 stars 15 forks source link

Need option to add filtering (SEG or DUST) to automatically-generated BLAST databases #21

Open mattb112885 opened 11 years ago

mattb112885 commented 11 years ago

Low-complexity region filtering will be useful particularly for TBLASTN... repeats in the genome that can give a false result. If I understand the NCBI docs correctly SEG is only added to the query sequences automatically, not the target sequences...

mattb112885 commented 11 years ago

James - have we decided whether or not we want to do this?

JamesRH commented 11 years ago

Our rule should be that (except for RPSblast, which is a special case), whenever there is any masking of query or database, we always pass the softmasking option (for query or database).

Since we use the query softmasking option for tblastn (where query masking is on by default) in the wrapper, and since you got rid of the scripts that did not do this, we should be fine. We just need to make sure that it is mentioned in the tblastn wrapper -h text and that we are not masking the databases we build in the main scripts. If you make those changes to documentation, then this can be closed.

I talked to Rachel about this some time ago, but I tried to email her about it and what I've learned about the new blast+, and she didn't remember our earlier conversation (she also didn't know about the new blast+, which is like professors being out of the lab long enough they don't know how to use the new pipettor, I suppose). I will double-check again, but I think we have the right approach.

James H

On 04/23/2013 09:22 PM, mattb112885 wrote:

James - have we decided whether or not we want to do this?

— Reply to this email directly or view it on GitHub https://github.com/mattb112885/clusterDbAnalysis/issues/21#issuecomment-16904750.

On 03/21/2013 05:10 PM, James Henriksen wrote:

Should I submit a bug? James H

On 03/21/2013 04:37 PM, Matthew Benedict wrote:

James:

Responses below... thanks for doing all the legwork on this. We still do need to decide whether or not to mask the query DB though... this is probably an argument someone has had with Rachel already and that we don't care to repeat, so next time you talk to Rachel could you ask her about it?

Thanks and Best

Matthew Benedict Chemical Engineering Graduate Student University of Illinois Email: matthew.n.benedict@gmail.com mailto:matthew.n.benedict@gmail.com

On Thu, Mar 21, 2013 at 3:03 PM, James Henriksen <jamesrh@illinois.edu mailto:jamesrh@illinois.edu> wrote:

Matt,

I bit the bullet and read the NCBI documentation for the new
blast+.  I
had to skim the entire online book, as it is horribly organized.
 Here
are my take-home messages for iTEP.  The main one is that when we use
masking or when it is the default (as with the query masking
default in
tblastn), we should use soft masking (for masking of queries, this is
the -soft_masking option, for database masking it is
-db_soft_mask).  In
searches without database or query masking, we don't need to use that
setting.

So by search:

tblastn and tblastx
Default is to mask the query without softmasking!  We must use
-soft_masking when we run these or alignments will break into pieces.
The default is NOT to use database masking, but if we do use it, we
should also use -db_soft_mask (see below).

_ITEP's wrapper currently uses the -soft_masking flag so nothing to worry about here unless we want to use -db_softmask as well.

blastn and blastx
Default IS to use masking of query with dust and to soft-mask the
query,
which is what we probably want.  To be explicit, we COULD use
-soft_masking in our code, but I don't think this is necessary.
 If you
do want to, blast should then use the default filter, which right
now is
-dust 'yes' for the '20 64 1' setting, I think.  The default is
NOT to
use database masking, but if we do use it, we should also use
-db_soft_mask (see below).

blastp
Default is NOT to use softmasking of query, but also NOT to mask at
all.  The default is NOT to use database masking, but if we do
use it,
we should also use -db_soft_mask (see below).  We could use
-soft_masking and -seg 'yes' to get faster searches, but I don't
think
it is necessary.  Low-information queries will just have a ton of
hits,
and that is OK.  It would probably speed up our searches if we were
masking queries, but if that is the goal we should probably also
softmask the databases.  I don't think we should do either.

Its OK as long as the entire query isn't low-information...small low-information hits may be filtered out because they only hit over small parts of the protein... and excluding things in the middle of proteins may cause the HSPs to break up and fubar our scoring metric (I'm not sure if this is really true but I'd be concerned about it. Could you answer this with your knowledge of how the masking works?).

rpsblast
Default is to mask the query without softmasking, but given what it
does, I think this is OK.  The default is NOT to use database
masking,
but if we do use it, we should also use -db_soft_mask (see below).

I could add softmasking the query to this as well if you think it would make the results better.

For all of the above:
Default is NOT to mask the query DB.  If we wanted to we could use
-db_soft_mask  ##, where ## is the filter ID applied to the blast db,
which we would have to generate for each database.  If we wanted to
speed up our searches (on the order of 10 seconds vs. 30 min for
megablast), we COULD also mask the database and/or the query, but we
would miss low information regions, even if we did use softmasking.

If there's low information we can't necessarily be confident in its correctness anyway right? Doing the masking on the DB isn't hard now that we have the right programs installed - just need to talk to Rachel to find out what the lab consensus is (and maybe change around the code a bit to make this an option to the user)

   James H.

PS Other information of use (you may know all this, but it was
news to
me since I tended to still use the old blast).

You can now search directly against a fasta file, without making a
database (it is slower, however):
like: blastn -subject bigfile.fasta -query sequences.fasta

I've done this by mistake calling -subject on a database and wondered why it segfaulted. But yes this would be a useful feature particualrly for deciding what to do with pulled in TBLASTN sequences

You can create virtual blast databases by aliasing together multiple
blast databases, or creating a subset based on gi_list (if it was
made
with the -parse_seqids option)
http://www.ncbi.nlm.nih.gov/books/NBK1763/#CmdLineAppsManual.I57_Use_blastdb_aliast
I think that if the database is generated with -taxid_map and
-parse_seqids than you can subset on those as well.  However,
trying to
implement this may mean we would have to make NCBI-style deflines
in our
original files, which may not like our IDs due to the "|" character.

*Yes that would be a pain.

  • There is an option that actually finds the best hit, instead of trying to find the best hit in a list of outputs (the first-is-often-not-the-best-hit problem): http://www.ncbi.nlm.nih.gov/books/NBK1763/#CmdLineAppsManual.I4212_BestHits_filteri (using -best_hit_score_edge 0.05 -best_hit_overhang 0.25)

    In addition to searching using a Conserved Domain Database query, you can use delta-blast to search using a PSSM or even using an alignment (if you give it an aliment that includes your query and the name of the protein you want to use as a query, it builds the PSSM on the fly). This may be very useful for finding distant homologs to a ortholog/cluster (prevents having to search using every sequence in the cluster), but of course only works for proteins.

Isn't this how PSI-blast works as well? I messed with PSI-blast, quickly got confused / crappy results and didn't go back to it although if we want to be serious about looking at larger genetic distances we will need to do something like this...