metagenlab / zAMP

zAMP is a bioinformatic pipeline designed for convenient, reproducible and scalable amplicon-based metagenomics
https://zamp.readthedocs.io/en/latest/
MIT License
7 stars 4 forks source link

Database processing with latest SILVA version #42

Open farchaab opened 1 month ago

farchaab commented 1 month ago

When using SILVA v138.1 (wSpecies_train_set) I get an error in Derep_and_merge_taxonomy.

>1
AACTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAG
TCGAGCGGCAGCACGGGTACTTGTACCTGGTGGCGAGCGGCGGACGGGTGAGTAATGCCT
>Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;amygdali;
>Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Pectobacteriaceae;Dickeya;phage;
>Bacteria;Actinobacteriota;Actinobacteria;Actinomycetales;Actinomycetaceae;F0332;
>Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;equi;
>Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;porcinus;
>Bacteria;Actinobacteriota;Actinobacteria;Pseudonocardiales;Pseudonocardiaceae;Saccharomonospora;
>Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;
>Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Anaerovoracaceae;[Eubacterium] nodatum group;
>Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Xanthobacteraceae;Bradyrhizobium;
>Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Porticoccaceae;Porticoccus;hydrocarbonoclasticus;

Inspecting the log:

[1] ‘1.0.2’

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

[1] ‘0.8.5’
[1] ‘0.20.41’
[1] ‘1.4.0’
[1] ‘0.5.0’
Error in `$<-.data.frame`(`*tmp*`, V2, value = character(0)) :
  replacement has 0 rows, data has 452064
Calls: $<- -> $<-.data.frame
Execution halted
valscherz commented 1 month ago

You are probably arleady aware, but just in case I think there are two differences with EzBioCloud explaining the errors:

  1. The genus is not repeted at species levels (To be verified, but in EzBioCloud it would report: >Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas amygdali;)
  2. There are here a variable numbers of ranks (species sometimes missing..)
farchaab commented 1 month ago

Hello @valscherz, I am aware of this and I am updating the script to repeat the genus in the species name and deal with missing ranks