motu-tool / mOTUs

motus - a tool for marker gene-based OTU (mOTU) profiling
GNU General Public License v3.0
144 stars 24 forks source link

Error extending mOTU 2.5.1 database #91

Closed Xentrics closed 2 years ago

Xentrics commented 2 years ago

I try to follow the instruction on Extending_the_mOTUs_database. I proceeded creating the conda environment and downloaded the mOTU 2.5.1 database (and moved it into the mOTU folder).

I am trying to run the test genomes, but I experience a crash. The pipeline starts successfully until it tries to create the GFF using runMakePaddedMG_DB. The errors says:

File "extend_mOTUs_DB/SCRIPTS//makePaddedMGdb.py", line 413, in runMakePaddedMG_DB getPaddedSequences(extractLocationsFileName, contigsFasta, paddedFastaFileName, dictExtractName2writeName) File "extend_mOTUs_DB/SCRIPTS//makePaddedMGdb.py", line 312, in getPaddedSequences if (line.startswith('>')): TypeError: startswith first arg must be bytes or a tuple of bytes, not str

I noticed that in the extend_mOTUs_addGenome.sh script, the python code is started using python2, which is indeed not present in the conda environment (it uses python 3.6.5).

Is this a python version error? Or is this a setup issue?

AlessioMilanese commented 2 years ago

It seems it's opening the file in bytes mode, and so it's calling bytes.startswith() and not str.startswith().

I noticed that in the extend_mOTUs_addGenome.sh script, the python code is started using python2, which is indeed not present in the conda environment (it uses python 3.6.5). Is this a python version error? Or is this a setup issue?

We don't support python 3 anymore. If you are using python 2, that might be the issue.

@hjruscheweyh did you see this before?

Xentrics commented 2 years ago

Just for completeness sake: no, I used python 3.6.5. It also appear with any of the test genomes (and the 3 additional ones that I try to add).

Could this be an issue with the reference database? I got the 2.5.1 database from https://zenodo.org/record/3366460#.YnzBY2pBxB8. I will try 2.5.0 in the meantime, just to check.

AlessioMilanese commented 2 years ago

Can you post here the call that you do?

Xentrics commented 2 years ago

Yes. After setting up the directories, installing conda environment & installing mOTUs, I set the motus directory & loop through the genomes:

source activate update_mOTUs_v2
MOTUS_DIR=/sbidata/seelbind/Projects/MS_ICU_Bauer/db/extend_mOTUs_DB/mOTUs_v2/
for i in $(cat extend_mOTUs_DB/TEST/genomes.list); do ./extend_mOTUs_DB/SCRIPTS/extend_mOTUs_addGenome.sh extend_mOTUs_DB/TEST/genomes/$i.fasta $i newdbfolder extend_mOTUs_DB/SCRIPTS/ $MOTUS_DIR; done

Or, a shorter example with one test genome:

i=DJUC00000000
./extend_mOTUs_DB/SCRIPTS/extend_mOTUs_addGenome.sh extend_mOTUs_DB/TEST/genomes/$i.fasta $i newdbfolder extend_mOTUs_DB/SCRIPTS/ $MOTUS_DIR

The full output log is this:

++ sequence_file=extend_mOTUs_DB/TEST/genomes/DJUC00000000.fasta
++ genome_ID=DJUC00000000
++ new_database_folder=newdbfolder
++ scriptDir=extend_mOTUs_DB/SCRIPTS/
++ mOTU_folder=/sbidata/seelbind/Projects/MS_ICU_Bauer/db/extend_mOTUs_DB/mOTUs_v2/
++ cutoffsFile=extend_mOTUs_DB/SCRIPTS//cutoffs_fscore_specIAsRef.csv
++ mOTU_MG_file=extend_mOTUs_DB/SCRIPTS//10RefMGs.IDs
++ unpadded_mOTUseqsFile=/sbidata/seelbind/Projects/MS_ICU_Bauer/db/extend_mOTUs_DB/mOTUs_v2//db_mOTU/db_mOTU_DB_NR.fasta
++ '[' '!' -f /sbidata/seelbind/Projects/MS_ICU_Bauer/db/extend_mOTUs_DB/mOTUs_v2//db_mOTU/db_mOTU_DB_NR.fasta.udb ']'
++ mkdir -p newdbfolder/dbs/
++ echo 'python extend_mOTUs_DB/SCRIPTS//makePaddedMGdb.py extend_mOTUs_DB/TEST/genomes/DJUC00000000.fasta newdbfolder/dbs/DJUC00000000 DJUC00000000 -f extend_mOTUs_DB/SCRIPTS//fetchMG/'
python extend_mOTUs_DB/SCRIPTS//makePaddedMGdb.py extend_mOTUs_DB/TEST/genomes/DJUC00000000.fasta newdbfolder/dbs/DJUC00000000 DJUC00000000 -f extend_mOTUs_DB/SCRIPTS//fetchMG/
++ python extend_mOTUs_DB/SCRIPTS//makePaddedMGdb.py extend_mOTUs_DB/TEST/genomes/DJUC00000000.fasta newdbfolder/dbs/DJUC00000000 DJUC00000000 -f extend_mOTUs_DB/SCRIPTS//fetchMG/
Running prodigal...
prodigal -a newdbfolder/dbs/DJUC00000000/DJUC00000000.all.proteins.fasta -d newdbfolder/dbs/DJUC00000000/DJUC00000000.all.genes.fasta -f gff -o newdbfolder/dbs/DJUC00000000/DJUC00000000.all.genes.gff -c -q -m  -p meta -i extend_mOTUs_DB/TEST/genomes/DJUC00000000.fasta
...done
Running fetchMGs...
extend_mOTUs_DB/SCRIPTS//fetchMG//fetchMG.pl -p -m extraction -t 1 -o newdbfolder/dbs/DJUC00000000/DJUC00000000_markerGenes newdbfolder/dbs/DJUC00000000/DJUC00000000.all.proteins.fasta

===================================================================================
         FetchMGs v1.0 - extraction of marker genes from protein sequences
          Copyright (c) 2012 Shinichi Sunagawa, Daniel R Mende, and EMBL
===================================================================================

Parsing cutoffs... done
Running HMMsearch... done
Processing results...
2330 entries from file newdbfolder/dbs/DJUC00000000/DJUC00000000.all.proteins.fasta were indexed in file newdbfolder/dbs/DJUC00000000/DJUC00000000.all.proteins.fasta.cidx
Processing completed!

Extracted sequences are saved in the directory: newdbfolder/dbs/DJUC00000000/DJUC00000000_markerGenes
...done
Parsing GFF file.
Traceback (most recent call last):
  File "extend_mOTUs_DB/SCRIPTS//makePaddedMGdb.py", line 451, in <module>
    status = main()
  File "extend_mOTUs_DB/SCRIPTS//makePaddedMGdb.py", line 446, in main
    runMakePaddedMG_DB(args.contigsFasta, args.outfolder, args.runPrefix, args.gffFileName, args.contigCoordsFileName, args.contigLengthsFileName,args.padLength, args.geneNamesFileName, args.proteinFasta, transTableNumber, args.metagenomicMode, numThreads, args.OGType, args.fetchMGfolder)
  File "extend_mOTUs_DB/SCRIPTS//makePaddedMGdb.py", line 413, in runMakePaddedMG_DB
    getPaddedSequences(extractLocationsFileName, contigsFasta, paddedFastaFileName, dictExtractName2writeName)
  File "extend_mOTUs_DB/SCRIPTS//makePaddedMGdb.py", line 312, in getPaddedSequences
    if (line.startswith('>')):
TypeError: startswith first arg must be bytes or a tuple of bytes, not str
AlessioMilanese commented 2 years ago

From the issue you get, it seems you are working on a gzip file. The problematic function is this:

def getPaddedSequences(extractLocationsFileName, contigsFasta, paddedFastaFileName, dictExtractName2writeName): 

    paddedFastaFile = open(paddedFastaFileName, "w")

    cat = subprocess.Popen(['cat', extractLocationsFileName], 
                            stdout=subprocess.PIPE,
                            )

    samtools = subprocess.Popen(['xargs', 'samtools', 'faidx', contigsFasta],
                            stdin=cat.stdout,
                            stdout=subprocess.PIPE,
                            )

    output = samtools.stdout

    for line in output:
        if (line.startswith('>')):  #  <------------------------------------------- ISSUE HERE
            line = line.strip()
            recordName = line.replace('>', '')
            newRecordName = dictExtractName2writeName[recordName].pop()
            newRecordName = newRecordName.strip()
            paddedFastaFile.write(">" + newRecordName + "\n")
        else:
            paddedFastaFile.write(line)

You could try to change the code in extend_mOTUs_DB/SCRIPTS//makePaddedMGdb.py at line 312, and add line = line.decode("utf-8"). Like:

    for line in output:
                line = line.decode("utf-8")  #  <--------------------------------- ADDED LINE
        if (line.startswith('>')):  #  <------------------------------------------- ISSUE HERE

But it might be that there is a another problem underneath. Maybe with samtools? (since output = samtools.stdout)

Xentrics commented 2 years ago

YES! That worked. I will check the rest of the test script, for completeness sake. But the script finished without further issue :)