Open lkalmar opened 1 year ago
Hi,
I would filter genomes based on completeness and contamination (based on CHECKM). You can find this information here: https://zenodo.org/record/7146984#.ZBa4qbTMIbk
Then you could either choose the genome with the best parameters (highest completeness and lowest contamination), or you could choose the genome that is in a centroid position. In other words, the genome that has the lowest distance to all other genomes in the cluster. You could calculate the distance with fastANI or MASH.
Thanks, I thought about a solution that doesn't require that much of re-processing. One would think that when these clusters / mOTUs were originally formed, something like this has been done already. Would be nice to have access to that data.
Hi,
What is your suggestion to extract the representative genome for a meta and ext mOTUs?
E.g., if we download meta_mOTU_v3_12240, it downloads 4361 genomes (even if we only choose one of the genomes it downloads all, but I saw it is there on your todo list already), and these genomes are ranging from ~800KB to ~4.8MB.
Our plan is to annotate the genomes we found in our metagenomics samples, and use the list of genes for further analysis. We have a list of about 2000 mOTUs (1/3 are ref, 2/3 are meta and ext), ideally we would like to end up with the same number of genomes to annotate (by prokka).
Should we use the genome that is the closest to the median / mean of the genome sizes in the mOTU?
Thanks in advance for your help