qunfengdong / BLCA

34 stars 12 forks source link

Assignment not making sense #31

Closed shump2 closed 2 years ago

shump2 commented 2 years ago

Hi, I have created a db (nt.ACC.taxonomy ~82 million records) using an updated version of the entire nt NCBI database. Some sequence queries fail to assign to the correct species despite the blastn hits being correct. For example: if I search one sequence against the nt database with the custom taxonomy file. The blastn returns 11 hits but the alignment returns a very strange result to COVID. I am not sure where to begin in resolving this and if there are any known issues? Any assistance here would be much appreciated.

Blastn file haplo_51122 KU317715.1 99.361 313 2 0 1 313 309 621 1.57e-157 568 307 plus 621 313 haplo_51122 MW124469.1 99.361 313 2 0 1 313 343 655 1.57e-157 568 307 plus 655 313 haplo_51122 MZ157283.1 99.361 313 2 0 1 313 384 696 1.57e-157 568 307 plus 15460 313 haplo_51122 KU317714.1 99.042 313 3 0 1 313 309 621 7.32e-156 562 304 plus 622 313 haplo_51122 KU317712.1 99.042 313 3 0 1 313 285 597 7.32e-156 562 304 plus 601 313 haplo_51122 KM245630.1 99.042 313 3 0 1 313 384 696 7.32e-156 562 304 plus 15461 313 haplo_51122 DQ525222.1 98.026 304 6 0 1 304 335 638 7.43e-146 529 286 plus 638 313 haplo_51122 KU317713.1 98.893 271 3 0 1 271 279 549 1.63e-132 484 262 plus 549 313 haplo_51122 MW124560.1 90.096 313 31 0 1 313 346 658 3.63e-109 407 220 plus 658 313 haplo_51122 MW124542.1 90.096 313 31 0 1 313 346 658 3.63e-109 407 220 plus 658 313 haplo_51122 DQ525226.1 90.099 303 30 0 1 303 335 637 2.83e-105 394 213 plus 638 313

Result seq_01 superkingdom:Viruses;96.0;phylum:Pisuviricota;96.0;class:Pisoniviricetes;96.0;order:Nidovirales;96.0;family:Coronaviridae;96.0;genus:Betacoronavirus;96.0;species:Severe acute respiratory syndrome-related coronavirus;96.0;

qunfengdong commented 2 years ago

Thanks for the report. To be clear, are you using the entire nt database? BLCA is designed to deal with marker genes instead of a generic database. You will need to use a particularly family of marker genes as the database, so that the database entries are in similar length. Otherwise, the subsequent multiple sequence alignment is not reliable. If you are using the entire nt database, the multiple sequence alignment may be a problem.

qunfengdong commented 2 years ago

if you can make your database and query available for us to download, we can take a look.

YJulyXing commented 2 years ago

Hi, could you check if the blastn file you showed was correct? In your blastn output, the query ID was "haplo_51122", but in the result file the ID was "seq_01"? I was wondering if they refer to the same thing?

On Wed, Nov 3, 2021 at 11:57 AM qunfengdong @.***> wrote:

if you can make your database and query available for us to download, we can take a look.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/qunfengdong/BLCA/issues/31#issuecomment-959543586, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKABWIXNOXWFM4VB2JALSBTUKFSVZANCNFSM5HJIRBKA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Yue "July" Xing, Ph.D. Postdoctoral research associate Ph.D. in Genetics and MS in Statistics at Texas A&M University Center for Biomedical Informatics Department of Medicine Stritch School of Medicine Loyola University Chicago

shump2 commented 2 years ago

Yes it's correct, I changed it as I copied it in. I will send on the database soon. It's large. The blast searches are quite accurate and I guess the final assignment will be based on the bit score ranks. Not sure how the covid sequences appears!

YJulyXing commented 2 years ago

That's wired. Could you also send the sequence of this one entry, and I'll take a look.

shump2 commented 2 years ago

Hi @YJulyXing @qunfengdong See here the link to the data: https://drive.google.com/drive/folders/1-WGhTt9wesYZbpY80I-A44QZkCtmtxte?usp=sharing

  1. nt.ACC.taxonomy file of the entire nt database ( wget https://ftp.ncbi.nlm.nih.gov/blast/db/nt.{00..47}.tar.gz)
  2. test (the sequence - should be Strombus gigas)
  3. test.blastn (the resulting blastn file generated)
  4. test.blca.out (clustalo)
  5. test1.blca.out (muscle)

Thanks for taking the time to look into this issue.

YJulyXing commented 2 years ago

Also, we don't think that using the entire nt database would work. You need to extract a family of maker genes. If the gene is inside a genome, it would mess up the multiple sequence alignment.

Peter Shum @.***> 于 2021年11月4日周四 下午12:27写道:

Hi @YJulyXing https://github.com/YJulyXing @qunfengdong https://github.com/qunfengdong See here the link to the data: https://drive.google.com/drive/folders/1-WGhTt9wesYZbpY80I-A44QZkCtmtxte?usp=sharing

  1. nt.ACC.taxonomy file of the entire nt database ( wget https://ftp.ncbi.nlm.nih.gov/blast/db/nt.{00..47}.tar.gz)
  2. test (the sequence - should be Strombus gigas)
  3. test.blastn (the resulting blastn file generated)
  4. test.blca.out (clustalo)
  5. test1.blca.out (muscle)

Thanks for taking the time to look into this issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/qunfengdong/BLCA/issues/31#issuecomment-961208563, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKABWIX4VTTC5JBH2FLBQOLUKKX7ZANCNFSM5HJIRBKA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

shump2 commented 2 years ago

Ok I understand that but these are all the same species albeit some are whole mitogenomes but the return a different species from the blast hits?

qunfengdong commented 2 years ago

@shump2 @YJulyXing We should have made it more clear in our documentation. For BLCA, we expect that the database sequences are of similar length: that is, the sequences are from a gene family. For example, all the 16S gene sequences have more or length the similar lengths (not identical, but similar). If the database sequences have very dramatically different length, the multiple sequence alignment may become a problem. In your case, if some of the sequences correspond to the whole mitogenomes, but others correspond to a particular gene in the mitogenomes, they are of very different lengths, which may create problems for reliable multiple sequence alignments.

YJulyXing commented 2 years ago

What is your blast database? Are you using the default blast database (16SMicrobial)?

On Thu, Nov 4, 2021 at 6:14 PM qunfengdong @.***> wrote:

@shump2 https://github.com/shump2 @YJulyXing https://github.com/YJulyXing We should have made it more clear in our documentation. For BLCA, we expect that the database sequences are of similar length: that is, the sequences are from a gene family. For example, all the 16S gene sequences have more or length the similar lengths (not identical, but similar). If the database sequences have very dramatically different length, the multiple sequence alignment may become a problem. In your case, if some of the sequences correspond to the whole mitogenomes, but others correspond to a particular gene in the mitogenomes, they are of very different lengths, which may create problems for reliable multiple sequence alignments.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/qunfengdong/BLCA/issues/31#issuecomment-961473503, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKABWIUUW6CN2WNBIOVHYCTUKMHULANCNFSM5HJIRBKA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Yue "July" Xing, Ph.D. Postdoctoral research associate Ph.D. in Genetics and MS in Statistics at Texas A&M University Center for Biomedical Informatics Department of Medicine Stritch School of Medicine Loyola University Chicago

qunfengdong commented 2 years ago

@shump2 do you mind providing your blast database? @YJulyXing needs to check if your database entries really have the correct taxa ID in the NCBI taxa database you used.