mmatschiner / tutorials

Tutorials on phylogenetic and phylogenomic inference
362 stars 167 forks source link

ERROR: Alignment ids have changed after mafft refinement! #6

Closed Yedomon closed 4 years ago

Yedomon commented 4 years ago

Hello Dr,

I use find_ortholods.py https://github.com/mmatschiner/continental/blob/154d2ef4afa74ffb0a2122d56ae8af79b33461ef/ortholog_identification/src/find_orthologs.py following this line codes:

`#1st step. Prepare BLAST databases for all assembly files.

for i in *.fasta do makeblastdb -in ${i} -dbtype nucl done

2nd step: Use the following commands to do so, ensuring that the fos lycopersici assembly, which will serve as a reference, is listed first:

ls lycopersici_4287.fasta > subjects.txt ls *.fasta | grep -v lycopersici_4287 >> subjects.txt

3rd step

for i in *_protein.fasta; do python3 find_orthologs.py -t -s 1 --refine ${i} subjects.txt; done`

Here is an example of the error message at the end.

find_orthologs.py

Settings: query: S288C_YDL090C_RAM1_protein.fasta (1 sequence) subject: lycopersici_4287.fasta subject: cepae_MRCU01.1.fasta subject: cubense_VMNF01.1.fasta subject: human_32931.fasta subject: lyco_D11_RBXW01.1.fasta subject: matthiodae_WJXY01.1.fasta subject: melonis_NJCY01.1.fasta subject: nicotinae_NJBY01.1.fasta subject: pisi_illumina_AGBI01.1.fasta subject: radicis_cucumerinum_MABQ02.1.fasta subject: radicis_lycoAGNB01.1.fasta subject: vasinfectum_pacbio_VINM01.1.fasta subject: verticiloides_ASM14955v1_genomic.fasta strictness: 1 evalue: - bitscore: - translate: yes genetic_code: standard window_length: - window_shift: - refine: yes minimum_completeness: - write_empty_files: no output_format: fasta overwrite: no

Running blast searches for query RAM1 YDL090C SGDID:S000002248 in subject lycopersici_4287.fasta... done in 1.403176 seconds. Found hit(s) with sufficient bitscore (97.1). Reading subject lycopersici_4287.fasta to retrieve nucleotide sequence... done in 0.401539 seconds. Running blast searches for query RAM1 YDL090C SGDID:S000002248 in subject cepae_MRCU01.1.fasta... done in 1.443767 seconds. Found hit(s) with sufficient bitscore (97.1). Reading subject cepae_MRCU01.1.fasta to retrieve nucleotide sequence... done in 0.382243 seconds. Running blast searches for query RAM1 YDL090C SGDID:S000002248 in subject cubense_VMNF01.1.fasta... done in 1.27805 seconds. Found hit(s) with sufficient bitscore (96.3). Reading subject cubense_VMNF01.1.fasta to retrieve nucleotide sequence... done in 0.264573 seconds. Running blast searches for query RAM1 YDL090C SGDID:S000002248 in subject human_32931.fasta... done in 1.2524 seconds. Found hit(s) with sufficient bitscore (97.1). Reading subject human_32931.fasta to retrieve nucleotide sequence... done in 0.275367 seconds. Running blast searches for query RAM1 YDL090C SGDID:S000002248 in subject lyco_D11_RBXW01.1.fasta... done in 1.493819 seconds. Found hit(s) with sufficient bitscore (97.1). Reading subject lyco_D11_RBXW01.1.fasta to retrieve nucleotide sequence... done in 0.460043 seconds. Running blast searches for query RAM1 YDL090C SGDID:S000002248 in subject matthiodae_WJXY01.1.fasta... done in 1.480662 seconds. Found hit(s) with sufficient bitscore (97.1). Reading subject matthiodae_WJXY01.1.fasta to retrieve nucleotide sequence... done in 0.182673 seconds. Running blast searches for query RAM1 YDL090C SGDID:S000002248 in subject melonis_NJCY01.1.fasta... done in 1.557427 seconds. Found hit(s) with sufficient bitscore (97.1). Reading subject melonis_NJCY01.1.fasta to retrieve nucleotide sequence... done in 0.377305 seconds. Running blast searches for query RAM1 YDL090C SGDID:S000002248 in subject nicotinae_NJBY01.1.fasta... done in 1.350107 seconds. Found hit(s) with sufficient bitscore (97.1). Reading subject nicotinae_NJBY01.1.fasta to retrieve nucleotide sequence... done in 0.125133 seconds. Running blast searches for query RAM1 YDL090C SGDID:S000002248 in subject pisi_illumina_AGBI01.1.fasta... done in 1.435069 seconds. Found hit(s) with sufficient bitscore (97.1). Reading subject pisi_illumina_AGBI01.1.fasta to retrieve nucleotide sequence... done in 0.199704 seconds. Running blast searches for query RAM1 YDL090C SGDID:S000002248 in subject radicis_cucumerinum_MABQ02.1.fasta... done in 1.377043 seconds. Found hit(s) with sufficient bitscore (97.1). Reading subject radicis_cucumerinum_MABQ02.1.fasta to retrieve nucleotide sequence... done in 0.461537 seconds. Running blast searches for query RAM1 YDL090C SGDID:S000002248 in subject radicis_lycoAGNB01.1.fasta... done in 1.319515 seconds. Found hit(s) with sufficient bitscore (97.1). Reading subject radicis_lycoAGNB01.1.fasta to retrieve nucleotide sequence... done in 0.12569 seconds. Running blast searches for query RAM1 YDL090C SGDID:S000002248 in subject vasinfectum_pacbio_VINM01.1.fasta... done in 1.308939 seconds. Found hit(s) with sufficient bitscore (97.1). Reading subject vasinfectum_pacbio_VINM01.1.fasta to retrieve nucleotide sequence... done in 0.371869 seconds. Running blast searches for query RAM1 YDL090C SGDID:S000002248 in subject verticiloides_ASM14955v1_genomic.fasta... done in 1.125272 seconds. Found hit(s) with sufficient bitscore (97.8). Reading subject verticiloides_ASM14955v1_genomic.fasta to retrieve nucleotide sequence... done in 0.250498 seconds. Producing alignment... done in 0.001559 seconds. Refining the alignment with mafft...ERROR: Alignment ids have changed after mafft refinement! (before: RAM1 YDL090C SGDID:S000002248, lycopersici_4287[&sseqid=QESU01000016.1,bitscore=97.1,nhits=1], cepae_MRCU01.1[&sseqid=MRCU01000003.1,bitscore=97.1,nhits=1]...; after:RAM1_YDL090C_SGDID:S000002248, lycopersici_4287[&sseqid=QESU01000016.1,bitscore=97.1,nhits=1], cepae_MRCU01.1[&sseqid=MRCU01000003.1,bitscore=97.1,nhits=1]...)

How can I fix this issue since the alignment file is not written?

Thanks in advance for your help.

Sincerely Yours.

mmatschiner commented 4 years ago

Hi,

I think the script does not expect spaces in sequence IDs. Can you remove them before running the script? I'm also not sure if perhaps the colons and commas in the sequence IDs may lead to problems sooner or later, so perhaps remove these as well.

Yedomon commented 4 years ago

After eliminating comma, spaces and colons, I reduced the IDs character number to 6 digits. That works perfectly. Thank you very much, Dr. Sincerely Yours.