Extract representative genome from a motu

lkalmar commented 1 year ago

Hi,

What is your suggestion to extract the representative genome for a meta and ext mOTUs?

E.g., if we download meta_mOTU_v3_12240, it downloads 4361 genomes (even if we only choose one of the genomes it downloads all, but I saw it is there on your todo list already), and these genomes are ranging from ~800KB to ~4.8MB.

Our plan is to annotate the genomes we found in our metagenomics samples, and use the list of genes for further analysis. We have a list of about 2000 mOTUs (1/3 are ref, 2/3 are meta and ext), ideally we would like to end up with the same number of genomes to annotate (by prokka).

Should we use the genome that is the closest to the median / mean of the genome sizes in the mOTU?

Thanks in advance for your help

AlessioMilanese commented 1 year ago

Hi,

I would filter genomes based on completeness and contamination (based on CHECKM). You can find this information here: https://zenodo.org/record/7146984#.ZBa4qbTMIbk

Then you could either choose the genome with the best parameters (highest completeness and lowest contamination), or you could choose the genome that is in a centroid position. In other words, the genome that has the lowest distance to all other genomes in the cluster. You could calculate the distance with fastANI or MASH.

lkalmar commented 1 year ago

Thanks, I thought about a solution that doesn't require that much of re-processing. One would think that when these clusters / mOTUs were originally formed, something like this has been done already. Would be nice to have access to that data.

motu-tool / motus_v3_genomes

Extract representative genome from a motu #4