qunfengdong / BLCA

34 stars 12 forks source link

ValueError: max() arg is an empty sequence #29

Closed lplough closed 2 years ago

lplough commented 2 years ago

I am having a similar issue to @shump2 (#19 and #28) using a custom database for CO1 that looks like this (formatted with tab between sequence id and taxonomy info (as in your 16s example).

AB000675        species:Paralichthys    olivaceus;genus:Paralichthys;family:Paralichthyidae;order:Pleuronectiformes;class:Actinopteri;phylum:Chordata;superki$
AB002173        species:Henosepilachna  enneasticta;genus:Henosepilachna;family:Coccinellidae;order:Coleoptera;class:Insecta;phylum:Arthropoda;superkingdom:E$
AB002175        species:Henosepilachna  boisduvali;genus:Henosepilachna;family:Coccinellidae;order:Coleoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eu$

Blastn runs fine (blast+ 2.9 or 2.12) then the script fails on the alignment. I am using python 3. In not sure how to install biopython for python 2 in a straigthtforward way - i have seen you suggest to others that this error may be corrected w/ python 2?

sample of Blastn output:

OTU2    HM191376    95.122  41  2   0   257 297 621 661 1.77e-09    65.8    35  plus    1237    313
OTU2    GQ154298    97.297  37  1   0   261 297 586 622 6.36e-09    63.9    34  plus    638 313
OTU2    KT193300    97.143  35  1   0   257 291 598 632 8.23e-08    60.2    32  plus    651 313
OTU2    HE964896    100.000 32  0   0   261 292 605 636 8.23e-08    60.2    32  plus    1143    313

See details here:

clustalo is located in your PATH!`
>  > Fasta file read in!
>  > Reading in taxonomy information! ....
blastn is located in your PATH!
> > Running blast!!
> > Blastn Finished!!
>  > read in blast file...
>  > blastn file opened
>  > blast output read in
>  > Start aligning reads...
Traceback (most recent call last):
  File "../BLCA/2.blca_main.py", line 412, in <module>
    outout.write(le + ":" + max(lexsum, key=lexsum.get) + ";" + str(max(lexsum.values())) + ";")
ValueError: max() arg is an empty sequence

same errors generated with clustalo or muscle. Must be a python error...

Any insight is much appreciated!

UPDATE - running with python2 and the current or 2.1 version of the program, I get similar errors:

Traceback (most recent call last):
  File "/Users/louisplough/Downloads/BLCA-2.1/2.blca_main.py", line 329, in <module>
    mx=max(tmpdic.values())
ValueError: max() arg is an empty sequence

Thanks, Louis

shump2 commented 2 years ago

Python versions is unlikely the issue. Make sure that the ACC.taxonomy files matches the NT database, i.e., restrict the blast database to your custom records. If you are getting blast hits outside this custom database you will likely experience those errors.

lplough commented 2 years ago

Thanks @shump2 . I can check to see if all seq IDs in the Taxonomy file (top 3 rows shown above) match the seq Ids from my blastn output. I would imagine that they do since I blasted my OTUs of interest (fasta file) against the custom database that corresponds to the taxonomy file. In other words, I didnt do a blast search against e.g. the entire genbank NT database and then use a different taxonomy file for the downstream steps of BLCA. FYI , I am using the Midori CO1 database formatted for BLCA. THe midori Co1 database i am using has ~ 530K fasta records and the taxonomy file has a very similar # of records (slightly diffferent because there were some redundant seq IDs in fasta database that had to be removed in order for blastn to run).

Or am I missing something more fundamental about how BLCA works? I have my fasta file of interest (~ 609 OTUs formated as a fasta file , i.e. 609 different sequences), the blast formated database for the Midori CO1 seqs (530K sequences), the blastn results of the my query (609 seqs) against the midori CO1 seqs, and then the taxonomy file for the Midori CO1 seqs.

qunfengdong commented 2 years ago

Thanks @shump2 for your help! @lplough if you still have trouble, I wonder if you can make your dataset temporarily downloadable for us to give it a try, we will delete your files afterwards, as we are not interested in doing your study :-)

lplough commented 2 years ago

@qunfengdong - can I email you the files directly via dropbox?

lplough commented 2 years ago

FYI - BLASTn seq's appear to be in the taxonomy file.

$cat midori_blca_dedup.fasta | grep -o ">" | wc -l 
  582591
$cat midori_blca_taxonomy.dedup.txt | wc -l
  582591

Same # of Fasta seqeunces in the database and in the taxonomy file

head midori_blca_taxonomy.dedup.txt 
AB000675     species:Paralichthys olivaceus;genus:Paralichthys;family:Paralichthyidae;order:Pleuronectiformes;class:Actinopteri;phylum:Chordata;superkingdom:Eukaryota;
AB002173     species:Henosepilachna enneasticta;genus:Henosepilachna;family:Coccinellidae;order:Coleoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota;
AB002175     species:Henosepilachna boisduvali;genus:Henosepilachna;family:Coccinellidae;order:Coleoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota;
AB002176     species:Henosepilachna septima;genus:Henosepilachna;family:Coccinellidae;order:Coleoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota;
AB002177     species:Henosepilachna pusillanima;genus:Henosepilachna;family:Coccinellidae;order:Coleoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota;
AB002178     species:Epilachna admirabilis;genus:Epilachna;family:Coccinellidae;order:Coleoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota;
AB002180     species:Henosepilachna vigintioctopunctata;genus:Henosepilachna;family:Coccinellidae;order:Coleoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota;
AB002181     species:Henosepilachna vigintioctopunctata;genus:Henosepilachna;family:Coccinellidae;order:Coleoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota;
AB002182     species:Henosepilachna vigintioctomaculata;genus:Henosepilachna;family:Coccinellidae;order:Coleoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota;
AB002183     species:Henosepilachna pustulosa;genus:Henosepilachna;family:Coccinellidae;order:Coleoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota;
head 97.lulu.centroids.FINAL.blastn
OTU2    HM191376    95.122  41  2   0   257 297 621 661 1.77e-09    65.8    35  plus    1237    313
OTU2    GQ154298    97.297  37  1   0   261 297 586 622 6.36e-09    63.9    34  plus    638 313
OTU2    KT193300    97.143  35  1   0   257 291 598 632 8.23e-08    60.2    32  plus    651 313
OTU2    HE964896    100.000 32  0   0   261 292 605 636 8.23e-08    60.2    32  plus    1143    313
OTU2    GQ341690    100.000 32  0   0   261 292 606 637 8.23e-08    60.2    32  plus    658 313
OTU2    EU311294    100.000 32  0   0   263 294 595 626 8.23e-08    60.2    32  plus    645 313
OTU2    EU311287    100.000 32  0   0   263 294 595 626 8.23e-08    60.2    32  plus    645 313
OTU2    EU311286    100.000 32  0   0   263 294 595 626 8.23e-08    60.2    32  plus    645 313
OTU2    KP113679    100.000 31  0   0   261 291 605 635 2.96e-07    58.4    31  plus    657 313
OTU2    KJ082971    100.000 31  0   0   261 291 605 635 2.96e-07    58.4    31  plus    657 313

first lines of the blastn output and taxonomy file

AND, grepping the first 10 seqs matched for OTU2 in blastn output against the taxonomy file, finds those sequences in the tax file, so I dont think this is an issue of mismatches between the BLAST database and taxonomy files.....

$head 97.lulu.centroids.FINAL.blastn | cut -f2 | grep -f /dev/stdin  midori_blca_taxonomy.dedup.txt 
EU311286     species:Tubifex blanchardi;genus:Tubifex;family:Tubificidae;order:Haplotaxida;class:Clitellata;phylum:Annelida;superkingdom:Eukaryota;
EU311287     species:Tubifex blanchardi;genus:Tubifex;family:Tubificidae;order:Haplotaxida;class:Clitellata;phylum:Annelida;superkingdom:Eukaryota;
EU311294     species:Tubifex blanchardi;genus:Tubifex;family:Tubificidae;order:Haplotaxida;class:Clitellata;phylum:Annelida;superkingdom:Eukaryota;
GQ154298     species:Dacus humeralis;genus:Dacus;family:Tephritidae;order:Diptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota;
GQ341690     species:Cymothoe zenkeri;genus:Cymothoe;family:Nymphalidae;order:Lepidoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota;
HE964896     species:Cymothoe zenkeri;genus:Cymothoe;family:Nymphalidae;order:Lepidoptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota;
HM191376     species:Stachyris ruficeps;genus:Stachyris;family:Timaliidae;order:Passeriformes;class:Aves;phylum:Chordata;superkingdom:Eukaryota;
KJ082971     species:Tachydromia luang;genus:Tachydromia;family:Hybotidae;order:Diptera;class:Insecta;phylum:Arthropoda;superkingdom:Eukaryota;
KP113679     species:Metapenaeopsis palmensis;genus:Metapenaeopsis;family:Penaeidae;order:Decapoda;class:Malacostraca;phylum:Arthropoda;superkingdom:Eukaryota;
KT193300     species:Malapterurus microstoma;genus:Malapterurus;family:Malapteruridae;order:Siluriformes;class:Actinopteri;phylum:Chordata;superkingdom:Eukaryota;
YJulyXing commented 2 years ago

Hello,

I am a postdoc at Dr. Dong's lab. I am currently maintaining the BLCA application. You may just email me the files via dropbox and I'll take a look at them and try to figure out the problem. My email is @.*** Thank you!

Best, Yue

On Thu, Sep 30, 2021 at 4:05 PM lplough @.***> wrote:

@qunfengdong https://github.com/qunfengdong - can I email you the files directly via dropbox?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/qunfengdong/BLCA/issues/29#issuecomment-931625665, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKABWIRRPAKWXOZ6VP7RJKTUES7HFANCNFSM5FARMLVQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Yue "July" Xing, Ph.D. Postdoctoral research associate Ph.D. in Genetics and MS in Statistics at Texas A&M University Center for Biomedical Informatics Department of Medicine Stritch School of Medicine Loyola University Chicago

lplough commented 2 years ago

Hi @YJulyXing Thanks! You email is hidden (as is mine I think - standard github). So, I opened a dummy repository (private) and invited you so you can access my files without making them fully public. It has dropbox links to the data. If you have any problems with accessing them, just let me know. Thanks!

lplough commented 2 years ago

@YJulyXing @qunfengdong Did you get my dropbox links to my dataset? Please let me know ifyou had any problems with.

YJulyXing commented 2 years ago

Yes, we have downloaded the 3 files you provided. I'll take a look and see what the issue is.

On Tue, Oct 5, 2021 at 1:59 PM lplough @.***> wrote:

@YJulyXing https://github.com/YJulyXing @qunfengdong https://github.com/qunfengdong Did you get my dropbox links to my dataset? Please let me know ifyou had any problems with.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/qunfengdong/BLCA/issues/29#issuecomment-934638151, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKABWIW2P5LQ2AGQTRG5BADUFM4INANCNFSM5FARMLVQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Yue "July" Xing, Ph.D. Postdoctoral research associate Ph.D. in Genetics and MS in Statistics at Texas A&M University Center for Biomedical Informatics Department of Medicine Stritch School of Medicine Loyola University Chicago

YJulyXing commented 2 years ago

Hello,

I was wondering for the "xxx.ACC.taxonomy" file you created in the database, is it empty or not?

On Tue, Oct 5, 2021 at 3:10 PM Yue "July" Xing @.***> wrote:

Yes, we have downloaded the 3 files you provided. I'll take a look and see what the issue is.

On Tue, Oct 5, 2021 at 1:59 PM lplough @.***> wrote:

@YJulyXing https://github.com/YJulyXing @qunfengdong https://github.com/qunfengdong Did you get my dropbox links to my dataset? Please let me know ifyou had any problems with.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/qunfengdong/BLCA/issues/29#issuecomment-934638151, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKABWIW2P5LQ2AGQTRG5BADUFM4INANCNFSM5FARMLVQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Yue "July" Xing, Ph.D. Postdoctoral research associate Ph.D. in Genetics and MS in Statistics at Texas A&M University Center for Biomedical Informatics Department of Medicine Stritch School of Medicine Loyola University Chicago

-- Yue "July" Xing, Ph.D. Postdoctoral research associate Ph.D. in Genetics and MS in Statistics at Texas A&M University Center for Biomedical Informatics Department of Medicine Stritch School of Medicine Loyola University Chicago

YJulyXing commented 2 years ago

Hello,

Please discard my previous message. I was able to run BLCA using your given files (with some modification for the database file) without errors.

Here's what I did:

  1. Your taxonomy database, "midori_blca_taxonomy.final.txt", has a tab instead of space inside species names. I changed the tab back to space.
  2. You had some rows in "midori_blca_taxonomy.final.txt" that don't start with "species:" in column 2. i.e. "x.1 macao;genus:Ara;family:Psittacidae;order:Psittaciformes;class:Aves;phylum:Chordata;superkingdom:Eukaryota;". To run BLCA this file needs to have "species:" in front of species names. I removed rows like this.
  3. I used makeblastdb to make blastn database for the fasta db file.
  4. I gunzipped the OTU file.

Finally I ran 2.blca_main.py and it worked without errors.

Please let me know if you have further questions or if you would like the modified db file.

On Thu, Oct 7, 2021 at 9:46 PM Yue "July" Xing @.***> wrote:

Hello,

I was wondering for the "xxx.ACC.taxonomy" file you created in the database, is it empty or not?

On Tue, Oct 5, 2021 at 3:10 PM Yue "July" Xing @.***> wrote:

Yes, we have downloaded the 3 files you provided. I'll take a look and see what the issue is.

On Tue, Oct 5, 2021 at 1:59 PM lplough @.***> wrote:

@YJulyXing https://github.com/YJulyXing @qunfengdong https://github.com/qunfengdong Did you get my dropbox links to my dataset? Please let me know ifyou had any problems with.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/qunfengdong/BLCA/issues/29#issuecomment-934638151, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKABWIW2P5LQ2AGQTRG5BADUFM4INANCNFSM5FARMLVQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Yue "July" Xing, Ph.D. Postdoctoral research associate Ph.D. in Genetics and MS in Statistics at Texas A&M University Center for Biomedical Informatics Department of Medicine Stritch School of Medicine Loyola University Chicago

-- Yue "July" Xing, Ph.D. Postdoctoral research associate Ph.D. in Genetics and MS in Statistics at Texas A&M University Center for Biomedical Informatics Department of Medicine Stritch School of Medicine Loyola University Chicago

-- Yue "July" Xing, Ph.D. Postdoctoral research associate Ph.D. in Genetics and MS in Statistics at Texas A&M University Center for Biomedical Informatics Department of Medicine Stritch School of Medicine Loyola University Chicago

lplough commented 2 years ago

Thanks @ YJulyXing! I didn't realize that my taxonomy file was improperly formatted - what looked like a tab was NOT a tab ;). I was able to fix it based on your recommendations (proper spacing and removed unexpected characters/spaces in the species names) and ran BLCA just fine!

Thanks again for looking at my data and for your help! Sorry that I didn't pick up on that formatting error earlier.

Cheers! Louis