Open dkoslicki opened 4 years ago
I thought I had excluded Viridiplantae, but I kept non-animal/plant eukaryote clades (e.g. SAR and other Protists) intentionally. Do we not want those? I'm a bit reluctant to throw out all Eukaryotes, which would mean excluding common clades like Protozoa and yeasts.
There are a number of non-microbial organisms in the training database. This is significantly slowing down the training step, as CMash was designed with small microbial organisms in mind For example, I find a lot of Eukaryota (plants and the like):
In case you're interested in reproducing, this was done with ETE3 via:
Given that the median file compressed
organism_files
file is 1.012MB, these are definitely outliers.Check median via:
with
median.awk
: