nlapier2 / Metalign

Metalign: efficient alignment-based metagenomic profiling via containment min hash
MIT License
32 stars 7 forks source link

Non-microbial organisms in training database #30

Open dkoslicki opened 4 years ago

dkoslicki commented 4 years ago

There are a number of non-microbial organisms in the training database. This is significantly slowing down the training step, as CMash was designed with small microbial organisms in mind For example, I find a lot of Eukaryota (plants and the like):

taxonomy file name compressed file size
Eukaryota,Viridiplantae,Streptophyta taxid_69332_genomic.fna.gz 457M
Eukaryota,Sar,Alveolata taxid_1563115_genomic.fna.gz 234M
Eukaryota,Sar,Alveolata taxid_2951_genomic.fna.gz 229M
Eukaryota,Sar,Alveolata taxid_1563116_genomic.fna.gz 207M
Eukaryota,Sar,Alveolata taxid_1280413_genomic.fna.gz 190M
Eukaryota,Sar,Stramenopiles taxid_88149_genomic.fna.gz 169M
Eukaryota,Opisthokonta,Fungi taxid_44941_genomic.fna.gz 163M
Eukaryota,Sar,Alveolata taxid_1172189_2_genomic.fna.gz 145M
Eukaryota,Sar,Stramenopiles taxid_4781_0_genomic.fna.gz 106M
Eukaryota,Rhodophyta,Florideophyceae taxid_38544_genomic.fna.gz 104M
Eukaryota,Sar,Stramenopiles taxid_162140_1_genomic.fna.gz 93M
Eukaryota,Opisthokonta,Fungi taxid_462795_0_genomic.fna.gz 92M
Eukaryota,Sar,Stramenopiles taxid_162130_1_genomic.fna.gz 91M
Eukaryota,Viridiplantae,Chlorophyta taxid_3046_genomic.fna.gz 89M
Eukaryota,Viridiplantae,Chlorophyta taxid_36881_genomic.fna.gz 83M

In case you're interested in reproducing, this was done with ETE3 via:

paste -d'|'  <(ls -S /data/dmk333/repos/Metalign/data/organism_files | head -n 15 | cut -d'_' -f2 | xargs -I{} sh -c "ete3 ncbiquery --search {} --info | cut -d',' -f3-5 | sed -n 2p") <(ls -S /data/dmk333/repos/Metalign/data/organism_files | head -n 15) <(ls -S /data/dmk333/repos/Metalign/data/organism_files | head -n 15 | xargs -I{} sh -c "du -sh /data/dmk333/repos/Metalign/data/organism_files/{} | cut -f1") | sed 's/^/|/g' | sed 's/$/|/g'

Given that the median file compressed organism_files file is 1.012MB, these are definitely outliers.

Check median via:

find /data/dmk333/repos/Metalign/data/organism_files -name "*.gz" | xargs -I{} du -s {} | sort -n | awk -f median.awk

with median.awk:

#/usr/bin/env awk
{
    count[NR] = $1;
}
END {
    if (NR % 2) {
        print count[(NR + 1) / 2];
    } else {
        print (count[(NR / 2)] + count[(NR / 2) + 1]) / 2.0;
    }
}
nlapier2 commented 4 years ago

I thought I had excluded Viridiplantae, but I kept non-animal/plant eukaryote clades (e.g. SAR and other Protists) intentionally. Do we not want those? I'm a bit reluctant to throw out all Eukaryotes, which would mean excluding common clades like Protozoa and yeasts.