Open mattb112885 opened 11 years ago
James - have we decided whether or not we want to do this?
Our rule should be that (except for RPSblast, which is a special case), whenever there is any masking of query or database, we always pass the softmasking option (for query or database).
Since we use the query softmasking option for tblastn (where query masking is on by default) in the wrapper, and since you got rid of the scripts that did not do this, we should be fine. We just need to make sure that it is mentioned in the tblastn wrapper -h text and that we are not masking the databases we build in the main scripts. If you make those changes to documentation, then this can be closed.
I talked to Rachel about this some time ago, but I tried to email her about it and what I've learned about the new blast+, and she didn't remember our earlier conversation (she also didn't know about the new blast+, which is like professors being out of the lab long enough they don't know how to use the new pipettor, I suppose). I will double-check again, but I think we have the right approach.
James H
On 04/23/2013 09:22 PM, mattb112885 wrote:
James - have we decided whether or not we want to do this?
— Reply to this email directly or view it on GitHub https://github.com/mattb112885/clusterDbAnalysis/issues/21#issuecomment-16904750.
On 03/21/2013 05:10 PM, James Henriksen wrote:
Should I submit a bug? James H
On 03/21/2013 04:37 PM, Matthew Benedict wrote:
James:
Responses below... thanks for doing all the legwork on this. We still do need to decide whether or not to mask the query DB though... this is probably an argument someone has had with Rachel already and that we don't care to repeat, so next time you talk to Rachel could you ask her about it?
Thanks and Best
Matthew Benedict Chemical Engineering Graduate Student University of Illinois Email: matthew.n.benedict@gmail.com mailto:matthew.n.benedict@gmail.com
On Thu, Mar 21, 2013 at 3:03 PM, James Henriksen <jamesrh@illinois.edu mailto:jamesrh@illinois.edu> wrote:
Matt, I bit the bullet and read the NCBI documentation for the new blast+. I had to skim the entire online book, as it is horribly organized. Here are my take-home messages for iTEP. The main one is that when we use masking or when it is the default (as with the query masking default in tblastn), we should use soft masking (for masking of queries, this is the -soft_masking option, for database masking it is -db_soft_mask). In searches without database or query masking, we don't need to use that setting. So by search: tblastn and tblastx Default is to mask the query without softmasking! We must use -soft_masking when we run these or alignments will break into pieces. The default is NOT to use database masking, but if we do use it, we should also use -db_soft_mask (see below).
_ITEP's wrapper currently uses the -soft_masking flag so nothing to worry about here unless we want to use -db_softmask as well.
blastn and blastx Default IS to use masking of query with dust and to soft-mask the query, which is what we probably want. To be explicit, we COULD use -soft_masking in our code, but I don't think this is necessary. If you do want to, blast should then use the default filter, which right now is -dust 'yes' for the '20 64 1' setting, I think. The default is NOT to use database masking, but if we do use it, we should also use -db_soft_mask (see below). blastp Default is NOT to use softmasking of query, but also NOT to mask at all. The default is NOT to use database masking, but if we do use it, we should also use -db_soft_mask (see below). We could use -soft_masking and -seg 'yes' to get faster searches, but I don't think it is necessary. Low-information queries will just have a ton of hits, and that is OK. It would probably speed up our searches if we were masking queries, but if that is the goal we should probably also softmask the databases. I don't think we should do either.
Its OK as long as the entire query isn't low-information...small low-information hits may be filtered out because they only hit over small parts of the protein... and excluding things in the middle of proteins may cause the HSPs to break up and fubar our scoring metric (I'm not sure if this is really true but I'd be concerned about it. Could you answer this with your knowledge of how the masking works?).
rpsblast Default is to mask the query without softmasking, but given what it does, I think this is OK. The default is NOT to use database masking, but if we do use it, we should also use -db_soft_mask (see below).
I could add softmasking the query to this as well if you think it would make the results better.
For all of the above: Default is NOT to mask the query DB. If we wanted to we could use -db_soft_mask ##, where ## is the filter ID applied to the blast db, which we would have to generate for each database. If we wanted to speed up our searches (on the order of 10 seconds vs. 30 min for megablast), we COULD also mask the database and/or the query, but we would miss low information regions, even if we did use softmasking.
If there's low information we can't necessarily be confident in its correctness anyway right? Doing the masking on the DB isn't hard now that we have the right programs installed - just need to talk to Rachel to find out what the lab consensus is (and maybe change around the code a bit to make this an option to the user)
James H. PS Other information of use (you may know all this, but it was news to me since I tended to still use the old blast). You can now search directly against a fasta file, without making a database (it is slower, however): like: blastn -subject bigfile.fasta -query sequences.fasta
I've done this by mistake calling -subject on a database and wondered why it segfaulted. But yes this would be a useful feature particualrly for deciding what to do with pulled in TBLASTN sequences
You can create virtual blast databases by aliasing together multiple blast databases, or creating a subset based on gi_list (if it was made with the -parse_seqids option) http://www.ncbi.nlm.nih.gov/books/NBK1763/#CmdLineAppsManual.I57_Use_blastdb_aliast I think that if the database is generated with -taxid_map and -parse_seqids than you can subset on those as well. However, trying to implement this may mean we would have to make NCBI-style deflines in our original files, which may not like our IDs due to the "|" character.
*Yes that would be a pain.
There is an option that actually finds the best hit, instead of trying to find the best hit in a list of outputs (the first-is-often-not-the-best-hit problem): http://www.ncbi.nlm.nih.gov/books/NBK1763/#CmdLineAppsManual.I4212_BestHits_filteri (using -best_hit_score_edge 0.05 -best_hit_overhang 0.25)
In addition to searching using a Conserved Domain Database query, you can use delta-blast to search using a PSSM or even using an alignment (if you give it an aliment that includes your query and the name of the protein you want to use as a query, it builds the PSSM on the fly). This may be very useful for finding distant homologs to a ortholog/cluster (prevents having to search using every sequence in the cluster), but of course only works for proteins.
Isn't this how PSI-blast works as well? I messed with PSI-blast, quickly got confused / crappy results and didn't go back to it although if we want to be serious about looking at larger genetic distances we will need to do something like this...
Low-complexity region filtering will be useful particularly for TBLASTN... repeats in the genome that can give a false result. If I understand the NCBI docs correctly SEG is only added to the query sequences automatically, not the target sequences...