lbobay / CoreCruncher

Fast tool to build the core genome of prokaryotic genomes. Can handle large datasets. Dependencis: Usearch & Numpy
10 stars 1 forks source link

alignment error with muscle #4

Closed bharat1912 closed 1 year ago

bharat1912 commented 1 year ago

Hi, Thank you corecruncher

I have been using corecruncher with (auto) prokka translated protein sequences with aligner muscle (v3.8.31) which fails but mafft goes to completion. So, perhaps the version of muscle may be an issue at least for my work.

I have used the following parameters: $python CoreCruncher/corecruncher_master.py -in proteins_A/ -out corecruncher_analysis/corecruncher_output_80/ -score 80 -align muscle -ref Cal_Anoxybacillus_rupensis_GAB.faa

The analysis terminates with the following: ###################################### Final core genome= 232 genes ouput written in corecruncher_analysis/corecruncher_output_80/ The file 'families_core.txt' contains the list of orthologous genes The directory 'core' contains the sequences each orthologous gene Reference genome used for the analysis= Cal_Anoxybacillus_rupensis_GAB.faa ###################################### Launch alignments with muscle Traceback (most recent call last): File "/home/bharat/opt/CoreCruncher/align.py", line 19, in sub = int(version.split(".")[0][-1]) TypeError: a bytes-like object is required, not 'str'

AND subsequently the concat script also fails as no alignment is found:

Traceback (most recent call last): File "/home/bharat/opt/CoreCruncher/concat.py", line 126, in while i < len(concat[sp][st]): # MODIF KeyError: 'Cal_Anoxybacillus_rupensis_GAB.faa'

I have also used another ref sequence but again same error occurs

Thanks

lbobay commented 1 year ago

Hi, Thanks for reporting the issue. The problem is due to the new version of muscle, which now uses different arguments. I thought I fixed the issue but apparently not! Could you tell me which operating system you are working with and which version of muscle you have on your computer? In addition, could you type muscle --version in your terminal and send me exactly what it returns to you. Thanks a lot.

bharat1912 commented 1 year ago

Thanks

  1. I use a Desktop with ubuntu 18.04 running conda ver 23.3.1 and python ver3.10.10. I run CoreCruncher from conda base

  2. Since your response, I have now tested two muscle versions on my test datasets and run without any issues but versions give the same error at align.py as below:

Launch alignments with muscle Traceback (most recent call last): File "/home/bharat/opt/CoreCruncher/align.py", line 19, in sub = int(version.split(".")[0][-1]) TypeError: a bytes-like object is required, not 'str'

  1. The output from muscle 3.8.31

MUSCLE v3.8.31 by Robert C. Edgar http://www.drive5.com/muscle This software is donated to the public domain. Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.

Basic usage muscle -in -out

Common options (for a complete list please see the User Guide): -in Input file in FASTA format (default stdin) -out Output alignment in FASTA format (default stdout) -diags Find diagonals (faster for similar sequences) -maxiters Maximum number of iterations (integer, default 16) -maxhours Maximum time to iterate in hours (default no limit) -html Write output in HTML format (default FASTA) -msf Write output in GCG MSF format (default FASTA) -clw Write output in CLUSTALW format (default FASTA) -clwstrict As -clw, with 'CLUSTAL W (1.81)' header -log[a] Log to file (append if -loga, overwrite if -log) -quiet Do not write progress messages to stderr -version Display version information and exit

Without refinement (very fast, avg accuracy similar to T-Coffee): -maxiters 2 Fastest possible (amino acids): -maxiters 1 -diags -sv -distance1 kbit20_3 Fastest possible (nucleotides): -maxiters 1 -diags

  1. The output from muscle 5.1.linux64 [12f0e2] 65.9Gb RAM, 32 cores Built Jan 13 2022 23:17:13 (C) Copyright 2004-2021 Robert C. Edgar. https://drive5.com

Align FASTA input, write aligned FASTA (AFA) output: muscle -align input.fa -output aln.afa

Align large input using Super5 algorithm if -align is too expensive, typically needed with more than a few hundred sequences: muscle -super5 input.fa -output aln.afa

Single replicate alignment: muscle -align input.fa -perm PERM -perturb SEED -output aln.afa muscle -super5 input.fa -perm PERM -perturb SEED -output aln.afa PERM is guide tree permutation none, abc, acb, bca (default none). SEED is perturbation seed 0, 1, 2... (default 0 = don't perturb).

Ensemble of replicate alignments, output in Ensemble FASTA (EFA) format, EFA has one aligned FASTA for each replicate with header line "<PERM.SEED": muscle -align input.fa -stratified -output stratified_ensemble.efa muscle -align input.fa -diversified -output diversified_ensemble.afa

-replicates N
    Number of replicates, defaults 4, 100, 100 for stratified,
      diversified, resampled. With -stratified there is one
      replicate per guide tree permutation, total is 4 x N.

Generate resampled ensemble from existing ensemble by sampling columns with replacement: muscle -resample ensemble.efa -output resampled.efa

-maxgapfract F
   Maximum fraction of gaps in a column (F=0..1, default 0.5).

-minconf CC
   Minimum column confidence (CC=0..1, default 0.5).

If ensemble output filename has @, then one FASTA file is generated for each replicate where @ is replaced by perm.s, otherwise all replicates are written to one EFA file.

Calculate disperson of an ensemble: muscle -disperse ensemble.efa

Extract replicate with highest total CC (diversified input recommended): muscle -maxcc ensemble.efa -output maxcc.afa

Extract aligned FASTA files from EFA file: muscle -efa_explode ensemble.efa

Convert FASTA to EFA, input has one filename per line: muscle -fa2efa filenames.txt -output ensemble.efa

Update ensemble by adding two sequences of digits to each replicate, digits are column confidence (CC) values, e.g. "73" means CC=0.73, "++" is CC=1.0: muscle -addconfseqs ensemble.efa -output ensemble_cc.efa

Calculate letter confidence (LC) values, -ref specifies the alignment to compare against the ensemble (e.g. from -maxcc), output is in aligned FASTA format with LC values 0, 1 ... 9 instead of letters: muscle -letterconf ensemble.efa -ref aln.afa -output letterconf.afa

-html aln.html
    Alignment colored by LC in HTML format.

-jalview aln.features
    Jalview feature file with LC values and colors.

More documentation at: https://drive5.com/muscle

Thanks

lbobay commented 1 year ago

Hi, I have updated the file align.py. I think it will fix the issue. Please replace this file by the new one in your CoreCruncher folder and let me know if it fixes the issue. Thanks a lot

bharat1912 commented 1 year ago

Hi, Thanks for the continued fast response.

I have used two different versions of muscle.

  1. muscle ver 5.1.linux64 [12f0e2] 65.9Gb RAM, 32 cores, Built Jan 13 2022 23:17:13
  2. MUSCLE v3.8.31

I still get an error with both versions, but on a different line ###################################### Final core genome= 340 genes ouput written in core_cruncer_muscle_70/ The file 'families_core.txt' contains the list of orthologous genes The directory 'core' contains the sequences each orthologous gene Reference genome used for the analysis= Anoxybacillus_flavithermus_strain_AF14.faa ###################################### Launch alignments with muscle Traceback (most recent call last): File "/home/bharat/opt/CoreCruncher/align.py", line 19, in sub = int(version.split(".")[0][-1].strip("v")) TypeError: a bytes-like object is required, not 'str'r in L16

thanks

bharat1912 commented 1 year ago

This error has been resolved with the revised align.py module.

Thanks