qunfengdong / BLCA

34 stars 12 forks source link

ValueError: max() arg is an empty sequence" #19

Closed dswan closed 5 years ago

dswan commented 5 years ago

Hi, I know this has been discussed in #5 #6 #10 and #13 but this is still biting me in certain conditions:

Git repo is up to date:

(base) ubuntu@155:/data/databases/BLCA$ git pull
Already up to date.

Running with python3 and clustalo:

(base) ubuntu@155:/data/databases/BLCA$ python3 2.blca_main.py -i ambig.fasta -a clustalo -r ncbi/16SMicrobial_BLCAparsed.taxonomy -q ncbi/16SMicrobial_BLCAparsed.fasta
clustalo is located in your PATH!
>  > Fasta file read in!
>  > Reading in taxonomy information! ....
blastn is located in your PATH!
> > Running blast!!
Warning: [blastn] Number of threads was reduced to 1 to match the number of available CPUs
> > Blastn Finished!!
>  > read in blast file...
>  > blastn file opened
>  > blast output read in
>  > Start aligning reads...
Traceback (most recent call last):
  File "2.blca_main.py", line 411, in <module>
    outout.write(le + ":" + max(lexsum, key=lexsum.get) + ";" + str(max(lexsum.values())) + ";")
ValueError: max() arg is an empty sequence

Running with python3 and muscle:

(base) ubuntu@155:/data/databases/BLCA$ python3 2.blca_main.py -i ambig.fasta -a muscle -r ncbi/16SMicrobial_BLCAparsed.taxonomy -q ncbi/16SMicrobial_BLCAparsed.fasta
muscle is located in your PATH!
>  > Fasta file read in!
>  > Reading in taxonomy information! ....
blastn is located in your PATH!
> > Running blast!!
Warning: [blastn] Number of threads was reduced to 1 to match the number of available CPUs
> > Blastn Finished!!
>  > read in blast file...
>  > blastn file opened
>  > blast output read in
>  > Start aligning reads...
>  > Taxonomy file generated!!
Time elapsed: 0 minutes

I get the same result with the NCBI 16S or a custom DB, so it's not DB formatting. Is there an issue with clustalo result parsing (I'm on v1.2.4)?

Interestingly - on another machine (python2.7, different arch) - I get the same error with both clustalo and muscle. I'm looking into that now..

qunfengdong commented 5 years ago

Hi Daniel: Thanks for your report. We are in the middle of the July 4th holiday long weekend. If you don't mind sending your input data to my personal email qunfengd@gmail.com, we will work on it on Monday July 8th after we return to our office. Of course we will keep your dataset confidential and remove them after the investigation. Qunfeng

On Jul 4, 2019, at 10:57 AM, Dr. Daniel Swan notifications@github.com wrote:

Hi, I know this has been discussed in #5 #6 #10 and #13 but this is still biting me in certain conditions:

Git repo is up to date:

(base) ubuntu@155:/data/databases/BLCA$ git pull Already up to date. Running with python3 and clustalo:

(base) ubuntu@155:/data/databases/BLCA$ python3 2.blca_main.py -i ambig.fasta -a clustalo -r ncbi/16SMicrobial_BLCAparsed.taxonomy -q ncbi/16SMicrobial_BLCAparsed.fasta clustalo is located in your PATH!

Fasta file read in! Reading in taxonomy information! .... blastn is located in your PATH! Running blast!! Warning: [blastn] Number of threads was reduced to 1 to match the number of available CPUs Blastn Finished!! read in blast file... blastn file opened blast output read in Start aligning reads... Traceback (most recent call last): File "2.blca_main.py", line 411, in outout.write(le + ":" + max(lexsum, key=lexsum.get) + ";" + str(max(lexsum.values())) + ";") ValueError: max() arg is an empty sequence Running with python3 and muscle:

(base) ubuntu@155:/data/databases/BLCA$ python3 2.blca_main.py -i ambig.fasta -a muscle -r ncbi/16SMicrobial_BLCAparsed.taxonomy -q ncbi/16SMicrobial_BLCAparsed.fasta muscle is located in your PATH!

Fasta file read in! Reading in taxonomy information! .... blastn is located in your PATH! Running blast!! Warning: [blastn] Number of threads was reduced to 1 to match the number of available CPUs Blastn Finished!! read in blast file... blastn file opened blast output read in Start aligning reads... Taxonomy file generated!! Time elapsed: 0 minutes I get the same result with the NCBI 16S or a custom DB, so it's not DB formatting. Is there an issue with clustalo result parsing (I'm on v1.2.4)?

Interestingly - on another machine (python2.7, different arch) - I get the same error with both clustalo and muscle. I'm looking into that now..

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

dswan commented 5 years ago

I will share shortly via email, enjoy the long weekend!

Did a bit more digging this morning. Same script, same dataset, different versions of python:

(py27_checkm) ubuntu@155:/data/databases/BLCA$ python2 2.blca_main.py -i ambig.fasta -a clustalo -r ncbi/16SMicrobial_BLCAparsed.taxonomy -q ncbi/16SMicrobial_BLCAparsed.fasta
clustalo is located in your PATH!
>  > Fasta file read in!
>  > Reading in taxonomy information! ....
blastn is located in your PATH!
> > Running blast!!
> > Blastn Finished!!
>  > read in blast file...
>  > blastn file opened
>  > blast output read in
>  > Start aligning reads...
>  > Taxonomy file generated!!
Time elapsed: 0 minutes
(py27_checkm) ubuntu@155:/data/databases/BLCA$ python --version; clustalo --version
Python 2.7.15
1.2.4

Same script, dataset, different python:

(base) ubuntu@i155:/data/databases/BLCA$ python3 2.blca_main.py -i ambig.fasta -a clustalo -r ncbi/16SMicrobial_BLCAparsed.taxonomy -q ncbi/16SMicrobial_BLCAparsed.fasta
clustalo is located in your PATH!
>  > Fasta file read in!
>  > Reading in taxonomy information! ....
blastn is located in your PATH!
> > Running blast!!
Warning: [blastn] Number of threads was reduced to 1 to match the number of available CPUs
> > Blastn Finished!!
>  > read in blast file...
>  > blastn file opened
>  > blast output read in
>  > Start aligning reads...
Traceback (most recent call last):
  File "2.blca_main.py", line 411, in <module>
    outout.write(le + ":" + max(lexsum, key=lexsum.get) + ";" + str(max(lexsum.values())) + ";")
ValueError: max() arg is an empty sequence
(base) ubuntu@155:/data/databases/BLCA$ python --version; clustalo --version
Python 3.7.3
1.2.4

So I notice that called with Python3 and not Python2 - this also generates an error about limiting the BLAST threads to 1 - which for some reason doesn't appear when invoked with Python2.

I'll drop the query fasta, and a copy of my BLCA formatted database somewhere (it's just the NCBI database, formatted outside of BLCA). It's not an issue with clustalo - I've tried multiple versions of it, with the same result.

dswan commented 5 years ago

Email sent :)

yingeddi2008 commented 5 years ago

Hi Daniel,

Thanks for bringing our attention to some potential bugs. I ran a simple test on the example you provided, somehow, I couldn't reproduce the error for both clustalo and muscle in python version 3.7.3. The error message could be a version bug for blastn too. What is your blastn version? I was using blast 2.7.1.

When I ran the test with blastn 2.9.0, I noticed the error message. It is a sequence name formatting issue with the new version of blast suite. I have updated the python 3 codes. It should work fine now.

Please let me know if you run into other issues.

dswan commented 5 years ago

Ah - that might be the issue, as I've been pegged to blastn 2.9.0. Is --parse_seqid called? It's caused issues with other packages with blastn 2.8+.

I can confirm that the latest BLCA release works fine with muscle and clustalo (and blastn 2.9.0) Thank you!

On Mon, Jul 8, 2019 at 10:31 PM yingeddi2008 notifications@github.com wrote:

Hi Daniel,

Thanks for bringing our attention to some potential bugs. I ran a simple test on the example you provided, somehow, I couldn't reproduce the error for both clustalo and muscle in python version 3.7.3. The error message could be a version bug for blastn too. What is your blastn version? I was using blast 2.7.1.

When I ran the test with blastn 2.9.0, I noticed the error message. It is a sequence name formatting issue with the new version of blast suite. I have updated the python 3 codes. It should work fine now.

Please let me know if you run into other issues.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/qunfengdong/BLCA/issues/19?email_source=notifications&email_token=AAAHUL73FQSUOQKIUF5IPRTP6OW27A5CNFSM4H542BDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZONHIA#issuecomment-509399968, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAHUL5YLQ46ATE63ACEZXLP6OW27ANCNFSM4H542BDA .

-- Dr Daniel Swan | Twitter/Instagram: @drdanielswan | Skype: drdanielswan | Web: http://danielswan.net