qunfengdong / BLCA

34 stars 12 forks source link

no assignment for OTU with numerous blast matches, even after relaxing blastn filtering criteria (e.g. % identity, evalue, aln len)? #30

Closed lplough closed 2 years ago

lplough commented 2 years ago

Hi There, I am getting about 1/3 of my OTU assigned taxonomy (output) from BLCA (2/3 "unassigned"). In some cases, those OTU/sequences have poor or no match to the blast database provided and even doing a remote blast against the entire genbank nt database provides no match. However, there are a number of OTU/sequences that return many ( often up to requested # of hits, e.g. 50 by BLCA default) with relatively low e values (<1e-20) 70-80% alignment length, and 80-95% identity). These would seem like usable matches that BLCA could estimate LCA on, though of course I don't expect high confidence in species level assignments when %ID is lower than 95%.

Anyway, have tried to alter (relax substantially) some of the filter settings (arguments) in 2.blca_main.py script, specifically for what I think is the BLASTn part of the script, but still get many many 'unassigned' results for OTU that would appear to have decent blast matches. I am wondering if I am misunderstanding how to set these arguments or where in the pipeline the filters are being applied?

For example, I am running this: 2.blca_main.py -i 97.centroids.fa -r Euk.Tax.BLCA.FINAL -q Euk.Co1.Tax -c .5 --iset 60

Following the description of the options from BLCA (or python 2.blca.main.py --help), I was trying to set the coverage filter pretty low (to 50%) and the minimum identity to 60% , also pretty low, to see if I could get more OTU to have some taxonomy information produced.

-e ESET, --eset ESET  minimum evalue to include for blastn. Default: 0.1
  -b BSET, --bset BSET  minimum bitscore to include for blastn hits. Default: 100
  -c CVRSET, --cvrset CVRSET
                        minimum coverage to include. Default: 0.85 [0-1]
  --iset ISET           minimum identity score to include. Default: 90 [0-100]

Here are my results (top lines of the *.blca.out results file from that BLCA run:

head *.out
OTU2    Unclassified
OTU3    Unclassified
OTU4    Unclassified
OTU5    Unclassified
OTU6    superkingdom:Eukaryota;100.0;phylum:Annelida;100.0;class:Clitellata;100.0;order:Haplotaxida;100.0;family:Naididae;100.0;genus:Vejdovskyella;100.0;species:Vejdovskyella sp._1_AEB-2009;100.0;

This is fine, but when looking at the blastn output used in the BLCA pipeline, I am not understanding why e.g. OTU2 was not assigned ( 90% identity, coverage 71.3 %) but OTU 6, with just 3 hits, was?

OTU2    HM462488.1  90.566  53  5   0   261 313 606 658 2.09e-10    71.3    38  plus    658 313
OTU2    GU203729.1  90.566  53  5   0   261 313 606 658 2.09e-10    71.3    38  plus    658 313
OTU2    HM408106.1  90.566  53  5   0   261 313 606 658 2.09e-10    71.3    38  plus    658 313
OTU2    JN281323.1  90.566  53  5   0   261 313 606 658 2.09e-10    71.3    38  plus    658 313
OTU2    JQ524682.1  90.566  53  5   0   261 313 606 658 2.09e-10    71.3    38  plus    658 313
OTU2    JQ524683.1  90.566  53  5   0   261 313 606 658 2.09e-10    71.3    38  plus    658 313
OTU2    JQ524688.1  90.566  53  5   0   261 313 606 658 2.09e-10    71.3    38  plus    658 313
OTU2    JQ524689.1  90.566  53  5   0   261 313 606 658 2.09e-10    71.3    38  plus    658 313
OTU2    JQ524731.1  90.566  53  5   0   261 313 606 658 2.09e-10    71.3    38  plus    658 313
OTU2    JQ572768.1  90.566  53  5   0   261 313 606 658 2.09e-10    71.3    38  plus    658 313
OTU2    JQ550806.1  90.566  53  5   0   261 313 606 658 2.09e-10    71.3    38  plus    658 313
OTU3    HE800620.1  91.071  56  5   0   203 258 548 603 4.49e-12    76.8    41  plus    658 313
OTU3    HF586205.1  91.071  56  5   0   203 258 548 603 4.49e-12    76.8    41  plus    649 313
OTU3    MG472165.1  92.453  53  3   1   203 255 548 599 1.61e-11    75.0    40  plus    602 313
OTU3    KR134776.1  93.617  47  3   0   206 252 551 597 2.09e-10    71.3    38  plus    658 313
OTU6    GQ355377.1  100.000 313 0   0   1   313 328 640 2.81e-163   579 313 plus    1153    313
OTU6    AF534863.1  90.064  312 31  0   2   313 331 642 5.04e-111   405 219 plus    1177    313
OTU6    KY633411.1  90.064  312 31  0   2   313 347 658 5.04e-111   405 219 plus    658 313
OTU7    JF304060.1  100.000 261 0   0   1   261 304 564 2.26e-134   483 261 plus    564 313

I feel as though I am not understanidng some of the '2.main.blca.py' options or am not implementing them correctly? And, given that OTU6 has just 3 blast hits returned, Im not sure how much Id trust the LCA results here? Can (should we) we set a minimum number of blast hits to use for a given OTU when aligning and estimating the taxonomy?

Thanks for any input you can provide!

LP

qunfengdong commented 2 years ago

Thanks for the report. Let us look into this and get back to you (perhaps in a few days). Re. "a minimum number of blast hits to use for a given OTU when aligning and estimating the taxonomy", it's a tricky thing that we had thought about it but did not figure out how to incorporate into our statistical framework. Imagine this case: your query only matches ONE database hit (just one, so it seems low), but it was a PERFECT match. I would still trust the result because it may simply reflect that there happens to be just one database entry from this taxon.

qunfengdong commented 2 years ago

By the way, just want to confirm: for those database hits of OTU2, do they have assigned taxa information?

lplough commented 2 years ago

Good question. I will check and get back to you.

lplough commented 2 years ago

They appear to be in my taxonomy file (as they should because I made sure that all acc # in the fasta database , the headers, were also present in the taxonomy file before running BLCA, per @YJulyXing 's reccomendation):

cat 97.centroids.blastn | awk '/^OTU2\t/ {print $0}'| cut -f2
HM462488.1
GU203729.1
HM408106.1
JN281323.1
JQ524682.1
JQ524683.1
JQ524688.1
JQ524689.1
JQ524731.1
JQ572768.1
JQ550806.1

searching for those acc#s in the taxonomy database:

cat 97.centroids.blastn | awk '/^OTU2\t/ {print $0}'| cut -f2 | grep -f /dev/stdin/ Euk.Tax.BLCA.FINAL 

JQ524731.1  species:Euglyphis sp._Janzen01;genus:Euglyphis;family:Lasiocampidae;order:Lepidoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota;
JQ550806.1  species:Euglyphis sp._Janzen06;genus:Euglyphis;family:Lasiocampidae;order:Lepidoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota;
GU203729.1  species:Euglyphis sp._Janzen16;genus:Euglyphis;family:Lasiocampidae;order:Lepidoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota;
JQ572768.1  species:Euglyphis sp._Janzen16;genus:Euglyphis;family:Lasiocampidae;order:Lepidoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota;
JQ524682.1  species:Euglyphis sp._Janzen30;genus:Euglyphis;family:Lasiocampidae;order:Lepidoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota;
JQ524683.1  species:Euglyphis sp._Janzen30;genus:Euglyphis;family:Lasiocampidae;order:Lepidoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota;
JQ524688.1  species:Euglyphis sp._Janzen30;genus:Euglyphis;family:Lasiocampidae;order:Lepidoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota;
JQ524689.1  species:Euglyphis sp._Janzen30;genus:Euglyphis;family:Lasiocampidae;order:Lepidoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota;
HM408106.1  species:Lepidoptera sp._BOLD_AAA2529;genus:NA;family:NA;order:Lepidoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota;
JN281323.1  species:Lepidoptera sp._BOLD_AAA2529;genus:NA;family:NA;order:Lepidoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota;
HM462488.1  species:Anomura sp._LPdivOTU101;genus:NA;family:NA;order:Decapoda;class:Malacostraca;phylum:Arthropoda;superkingdom:Eukaryota;

So, they are there, right?

Louis

YJulyXing commented 2 years ago

Hi, could you try to set "-b BSET" something around 60? In your blastn output, it seems like OUT2's hits all had bitscores = 71.3, which was lower than the default 100. I think this is the reason your OUT2 didn't get assigned.

On Thu, Oct 28, 2021 at 11:36 AM lplough @.***> wrote:

They appear to be in my taxonomy file (as they should because I made sure that all acc # in the fasta database , the headers, were also present in the taxonomy file before running BLCA, per @YJulyXing https://github.com/YJulyXing 's reccomendation):

cat 97.centroids.blastn | awk '/^OTU2\t/ {print $0}'| cut -f2 HM462488.1 GU203729.1 HM408106.1 JN281323.1 JQ524682.1 JQ524683.1 JQ524688.1 JQ524689.1 JQ524731.1 JQ572768.1 JQ550806.1

searching for those acc#s in the taxonomy database:

cat 97.centroids.blastn | awk '/^OTU2\t/ {print $0}'| cut -f2 | grep -f /dev/stdin/ Euk.Tax.BLCA

JQ524731.1 species:Euglyphis sp._Janzen01;genus:Euglyphis;family:Lasiocampidae;order:Lepidoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota; JQ550806.1 species:Euglyphis sp._Janzen06;genus:Euglyphis;family:Lasiocampidae;order:Lepidoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota; GU203729.1 species:Euglyphis sp._Janzen16;genus:Euglyphis;family:Lasiocampidae;order:Lepidoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota; JQ572768.1 species:Euglyphis sp._Janzen16;genus:Euglyphis;family:Lasiocampidae;order:Lepidoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota; JQ524682.1 species:Euglyphis sp._Janzen30;genus:Euglyphis;family:Lasiocampidae;order:Lepidoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota; JQ524683.1 species:Euglyphis sp._Janzen30;genus:Euglyphis;family:Lasiocampidae;order:Lepidoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota; JQ524688.1 species:Euglyphis sp._Janzen30;genus:Euglyphis;family:Lasiocampidae;order:Lepidoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota; JQ524689.1 species:Euglyphis sp._Janzen30;genus:Euglyphis;family:Lasiocampidae;order:Lepidoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota; HM408106.1 species:Lepidoptera sp._BOLD_AAA2529;genus:NA;family:NA;order:Lepidoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota; JN281323.1 species:Lepidoptera sp._BOLD_AAA2529;genus:NA;family:NA;order:Lepidoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota; HM462488.1 species:Anomura sp._LPdivOTU101;genus:NA;family:NA;order:Decapoda;class:Malacostraca;phylum:Arthropoda;superkingdom:Eukaryota;

So, they are there, right?

Louis

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/qunfengdong/BLCA/issues/30#issuecomment-953964143, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKABWIVWHAI6RN2VGFO3ZRLUJFUXLANCNFSM5G47FKJQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Yue "July" Xing, Ph.D. Postdoctoral research associate Ph.D. in Genetics and MS in Statistics at Texas A&M University Center for Biomedical Informatics Department of Medicine Stritch School of Medicine Loyola University Chicago

lplough commented 2 years ago

Ah. Thanks. I realized that all alignments for OTU2 were less than 25% (only ~50 bp/313 bp aligned), so it was being filtered out via the (already quite lenient) alignment filter I had set. What to do with OTU that appear 'real' (substantial # of sequences in the dataset) and make it through the previous pipeline steps (chimera removal, denoising, etc) but have such poor alignments via BLAST? Even after BLASTING OTU2 remotely to the entire NCBI 'nt' database, no alignments were over 40% of the subjects.... I guess that is beyond the scope of BLCA :) .

Thanks again for your help and sorry I didn't pick up that issue myself.

Louis