qunfengdong / BLCA

34 stars 12 forks source link

Error writing output: ValueError: max() arg is an empty sequence #6

Closed Talitrus closed 6 years ago

Talitrus commented 6 years ago

Hi Eddi,

I've been trying to get BLCA to run with a custom database. As far as I can tell, I have everything formatted as specified in the README, but BLCA errors out while trying to write the output files with the following messages:

blastdbcmd is located in your PATH!
muscle is located in your PATH!
>  > Fasta file read in!!
>  > Read in taxonomy information!
blastn is located in your PATH!
> > Running blast!!
> > Blastn Finished!!
>  > Read in blast output!
Traceback (most recent call last):
  File "/groups/cbi/shared/apps/BLCA/BLCA.git/2.blca_main.py", line 352, in <module>
    outout.write(le+":"+max(lexsum,key=lexsum.get)+";"+str(max(lexsum.values()))+";")
ValueError: max() arg is an empty sequence

I managed to get it run successfully once or twice with different sequence IDs, but haven't been able to replicate it. Even then, it kept identifying sequences as "Unclassified," though. This would probably be a separate issue, though.

I've uploaded my blastdb, reference FASTA used to generate the blastdb (midori_blca_dedup.fasta), reference taxonomy file (midori_blca_taxonomy.txt), and two test input FASTA files (uniques10_trim.fasta) and (other_BLCA_test_input.fasta) to a Dropbox link here: https://www.dropbox.com/sh/m3hie0b9o8ldc29/AABHeBAiS94lwl_skgtEY-qIa?dl=0

Thanks for your help!

Cheers, Bryan

yingeddi2008 commented 6 years ago

Hi Bryan,

Thanks for using our software.

After examining your file, I noticed that you do not have the same number of database fasta reads as in your taxonomy file. You have 583043 lines of taxonomy information in midori_blca_taxonomy.txt, while you have 582591 reads in the midori_blca.dedup.fasta. But I do not think it caused the error.

Another thing is that the deliminator in the taxonomy file should be a tab instead of space. I forgot to emphasize it in the instruction. It should be formatted as the following.

DQ523177[\t]species:Gammarus tigrinus;genus:Gammarus;family:Gammaridae;order:Amphipoda;class:Malacostraca;phylum:Arthropoda;superkingdom:Eukaryota;

I fixed the error by changing the space into tab. The outputs from the two test fasta files are as the following:

AY937375.1 superkingdom:Eukaryota;100.0;phylum:Cnidaria;100.0;class:Hydrozoa;100.0;order:Siphonophorae;100.0;family:Prayidae;100.0;genus:Praya;100.0;species:Praya dubia;100.0;

DQ133904.1 superkingdom:Eukaryota;100.0;phylum:Porifera;100.0;class:Demospongiae;100.0;order:Poecilosclerida;100.0;family:Tedaniidae;100.0;genus:Tedania;100.0;species:Tedania ignis;100.0;

1 Unclassified

3 superkingdom:Eukaryota;100.0;phylum:Cnidaria;100.0;class:Anthozoa;100.0;order:Actiniaria;100.0;family:Aiptasiidae;100.0;genus:Aiptasia;100.0;species:Aiptasia pulchella;100.0;

2 Unclassified

5 superkingdom:Eukaryota;100.0;phylum:Porifera;100.0;class:Demospongiae;100.0;order:Poecilosclerida;100.0;family:Tedaniidae;100.0;genus:Tedania;100.0;species:Tedania ignis;100.0;

4 superkingdom:Eukaryota;100.0;phylum:Porifera;100.0;class:Demospongiae;100.0;order:Poecilosclerida;100.0;family:Tedaniidae;100.0;genus:Tedania;100.0;species:Tedania ignis;100.0;

other_BLCA_test_input.fasta.blca.out (END)

sq1abc123 Unclassified

sq10abc123 superkingdom:Eukaryota;100.0;phylum:Porifera;100.0;class:Demospongiae;100.0;order:Poecilosclerida;100.0;family:Microcionidae;100.0;genus:Clathria;100.0;species:Clathria abietina;73.5;

sq3abc123 superkingdom:Eukaryota;100.0;phylum:Cnidaria;100.0;class:Anthozoa;100.0;order:Actiniaria;100.0;family:Aiptasiidae;100.0;genus:Aiptasia;100.0;species:Aiptasia pulchella;100.0;

sq2abc123 Unclassified

sq5abc123 superkingdom:Eukaryota;100.0;phylum:Porifera;100.0;class:Demospongiae;100.0;order:Poecilosclerida;100.0;family:Tedaniidae;100.0;genus:Tedania;100.0;species:Tedania ignis;100.0;

sq7abc123 superkingdom:Eukaryota;100.0;phylum:Cnidaria;100.0;class:Anthozoa;100.0;order:Scleractinia;100.0;family:Agariciidae;100.0;genus:Agaricia;100.0;species:Agaricia agaricites;65.75;

sq6abc123 Unclassified

sq9abc123 superkingdom:Eukaryota;100.0;phylum:Porifera;100.0;class:Demospongiae;100.0;order:Poecilosclerida;100.0;family:Microcionidae;100.0;genus:Clathria;100.0;species:Clathria abietina;56.5;

sq4abc123 superkingdom:Eukaryota;100.0;phylum:Porifera;100.0;class:Demospongiae;100.0;order:Poecilosclerida;100.0;family:Tedaniidae;100.0;genus:Tedania;100.0;species:Tedania ignis;100.0;

sq8abc123 superkingdom:Eukaryota;100.0;phylum:Cnidaria;100.0;class:Anthozoa;100.0;order:Scleractinia;100.0;family:Agariciidae;100.0;genus:Agaricia;100.0;species:Agaricia agaricites;66.5;

uniques10_trim.fasta.blca.out (END)

I believe if you do the same, the error message will disappear, and BLCA will output the taxonomy of your reads.

I will add a check statement in the taxonomy reading-in function, and update the README file on github.

Please let me know if you have any other problems,

Eddi

On Mon, Feb 12, 2018 at 11:41 AM, Bryan Nguyen notifications@github.com wrote:

Hi Eddi,

I've been trying to get BLCA to run with a custom database. As far as I can tell, I have everything formatted as specified in the README, but BLCA errors out while trying to write the output files with the following messages:

blastdbcmd is located in your PATH! muscle is located in your PATH!

Fasta file read in!! Read in taxonomy information! blastn is located in your PATH! Running blast!! Blastn Finished!! Read in blast output! Traceback (most recent call last): File "/groups/cbi/shared/apps/BLCA/BLCA.git/2.blca_main.py", line 352, in outout.write(le+":"+max(lexsum,key=lexsum.get)+";"+str(max(lexsum.values()))+";") ValueError: max() arg is an empty sequence

I managed to get it run successfully once or twice with different sequence IDs, but haven't been able to replicate it. Even then, it kept identifying sequences as "Unclassified," though. This would probably be a separate issue, though.

I've uploaded my blastdb, reference FASTA used to generate the blastdb (midori_blca_dedup.fasta), reference taxonomy file (midori_blca_taxonomy.txt), and two test input FASTA files (uniques10_trim.fasta) and (other_BLCA_test_input.fasta) to a Dropbox link here: https://www.dropbox.com/sh/m3hie0b9o8ldc29/ AABHeBAiS94lwl_skgtEY-qIa?dl=0

Thanks for your help!

Cheers, Bryan

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/qunfengdong/BLCA/issues/6, or mute the thread https://github.com/notifications/unsubscribe-auth/AHCP06CrnvVgNhUKDEGS3TC-MfezL-iUks5tUHe0gaJpZM4SCjsg .

Talitrus commented 6 years ago

Fantastic! Yes, the differing number of lines/reads is because I generated the taxonomy file before deduplication, but I was hoping it wouldn't make a difference (just being lazy). The Midori database has some duplicate sequence IDs, which I didn't realize until I tried to generate the BLAST db.

Thanks for the clarification. I'm looking forward to giving BLCA a try. The results you posted look promising.

Cheers, Bryan