Detect if multiple genomes are present

LeeBergstrand commented 9 months ago

@jmtsuji Is there any quick read-level way of detecting if there is a high likelihood of multiple genomes being present before assembly? If my employers commercialize this, there is a high probability that people will send in samples they may not know are not pure cultures. I've seen this multiple times before in my career. Clients claim they have a pure culture, but it consists of multiple organisms. You could do something like a Kracken taxonomy. This isn't necessarily a feature we should add to the pipeline right now (e.g. Kracken takes a lot of RAM). I just wanted to discuss options with you. Is there a more straightforward way to perform this task? If it has multiple genomes, the user gets a warning and can then swap to running the data using a metagenome assembler like ATLAS.

LeeBergstrand commented 9 months ago

I guess checkm or gtdbtk would also give you this after assembly?

jmtsuji commented 9 months ago

@LeeBergstrand Nice idea! I've already run into this problem once, ha ha. I think the CheckM contamination statistic is probably the most straightforward way to identify mixed populations in the sample.

On a related note: it should be possible to add a simple genome binning step to rotary as an optional config param, if we want to. Rotary can technically already handle metagenome data until the genome annotation step (if flye is set to 'meta' mode in the config). We would just need to add simple binning rules (e.g., that map reads from the single input sample, not doing multi-sample mapping, to keep things simple) and then modify the annotation rules so they include an extra wildcard for the genome ID. Do you think an optional genome binning step might be a helpful addition to rotary in the longer term? If so, we could add it as a long-term feature request. I agree that, for now, it would be better to just refer users to a different tool if it looks like their sample contains mixed populations.

LeeBergstrand commented 9 months ago

@jmtsuji, I'm mostly thinking about this from a high-throughput or sequencing-center perspective. I'm talking about tens or hundreds of genomes per month, genomes of unknown origin, genomes of unknown quality level, and full automation from BaseSpace cloud storage up. I'm thinking about doing things completely hands-off. From that perspective, detecting a sample containing multiple genomes before wasting all the computing time on the assembly and annotation is beneficial. Having multiple tools could potentially lead to errors in annotation pipelines, etc. One way to check for multiple genomes beforehand would be to run the pre-assembled reads through Kracken or Kaiju before assembly. These tools can scan through all the reads in only a few minutes (but require lots of RAM), build you an abundance table, and then you can use abundance cutoffs to determine whether the sample contains a "metagenome" or not. You could also do some kmer spectrum or GC content spectrum stuff (if you plot GC content curves, sometimes you will get two peaks if there are two organisms). There are also wet lab ways, such as doing a 16S run for quality assurance before extracting the genomic DNA. Some of these ideas might be out of scope for this pipeline.

Overall, genome binning is a potential long-term feature. I would wait until the high-accuracy nanopore kits are widely circulated. Let us focus on doing single genomes for now with a checkm warning.

jmtsuji commented 8 months ago

@LeeBergstrand Thanks for the additional context! Yes, I agree that a quick pre-screen could be done with Kracken or Kaiju. My one concern is that I think the pre-screen will be ineffective for highly novel organisms (e.g., that belong to a novel phylogenetic order, class, etc.). For novel organisms, the sample might look "contaminated" just because the reads don't have good taxonomic assignment (and so are mapped to many different taxa). I wonder if you could use good scoring thresholds with Kracken/Kaiju to detect if the organism is novel, though. If so, then you could let samples with novel organisms pass through the check. Even if you do this initial check, I think it would still be good to do a second check downstream after assembly (e.g., using CheckM) for a higher-quality view of whether you have contamination.

Re: genome binning: sounds good to keep this in mind as a potential long-term feature, but no need to work on it for now.

Also, sounds good to add a contamination warning based on CheckM stats to rotary for now. Kracken/Kaiju (and the other ideas you mentioned like k-mer frequency and GC content) could be interesting to consider for the future or could be a separate pipeline run before rotary, when doing things high-throughput. Sounds good?

LeeBergstrand commented 8 months ago

@LeeBergstrand Thanks for the additional context! Yes, I agree that a quick pre-screen could be done with Kracken or Kaiju. My one concern is that I think the pre-screen will be ineffective for highly novel organisms (e.g., that belong to a novel phylogenetic order, class, etc.).

You could small brain this even more than what you suggested. I have code that converts kaiju output to an observation counts table (e.g. an ASV table). For a passing sample, you are looking for over 90% of the reads to belong to a single taxonomic clade rather than 50% from one clade and 50% from another clade or 30%/30%/30% for three organisms. How likely is it that the contaminating species will be in the same species or genus as the primary species? I also assume that contaminating species are more likely to have been sequenced before and are thus easily detectable and differentiable from the primary species. So my approach would be to run Kaiju, do some taxonomy collapsing and then throw a warning if a second clade is at an abundance of more than 10% of the reads or some other cutoff. If the reads from the second organism are less than 10%, then they will fail to assemble and will likely get filtered out by the contig filter. My biggest concern is when you get a culture with a 50%/50% abundance of two organisms because that will cause a lot of misassemblies and potentially a chimeric genome without you knowing.

LeeBergstrand commented 8 months ago

Even if you do this initial check, I think it would still be good to do a second check downstream after assembly (e.g., using CheckM) for a higher-quality view of whether you have contamination.

Agreed

Also, it sounds good to add a contamination warning based on CheckM stats to rotary for now.

I agree; this is an excellent first step.

could be a separate pipeline run before rotary

This is something I am considering. Run a check and then run a metagenomics pipeline (or Rotary with binning) if it fails the check.

LeeBergstrand commented 8 months ago

I agree that, for now, it would be better to just refer users to a different tool if it looks like their sample contains mixed populations.

@jmtsuji I agree with this I think its import to keep the scope narrow for now.

LeeBergstrand commented 8 months ago

@jmtsuji My main concern with the short-read classification approach is that it needs a lot of RAM. There might be simpler ways to do this.

jmtsuji commented 8 months ago

I have code that converts kaiju output to an observation counts table (e.g. an ASV table). For a passing sample, you are looking for over 90% of the reads to belong to a single taxonomic clade rather than 50% from one clade and 50% from another clade or 30%/30%/30% for three organisms. How likely is it that the contaminating species will be in the same species or genus as the primary species? I also assume that contaminating species are more likely to have been sequenced before and are thus easily detectable and differentiable from the primary species. So my approach would be to run Kaiju, do some taxonomy collapsing and then throw a warning if a second clade is at an abundance of more than 10% of the reads or some other cutoff. If the reads from the second organism are less than 10%, then they will fail to assemble and will likely get filtered out by the contig filter. My biggest concern is when you get a culture with a 50%/50% abundance of two organisms because that will cause a lot of misassemblies and potentially a chimeric genome without you knowing.

Nice idea to check by relative abundances using an observation table format. On my end, one key question is how to define the "clades" that we would collapse the short read taxonomies to. For some of the novel organisms I grow, if I remember correctly, you see a spread of different taxonomies that are assigned to the short reads. Most of these taxonomic classifications might be from the same phylum, but they might have contrasting taxonomy names at lower ranks. (This is just from memory, though, and I don't remember what tool I used...) If the "clades" are just collapsed at the lowest taxonomic ranks, then highly novel organisms might look like highly contaminated cultures. To address this issue, I wonder if you could filter the short read hits to some scoring (e.g., e-value) cutoff beforehand and say that samples with a certain % if reads with poor e-values can bypass the short read purity check? (I can send you a genome if you'd like to try this out with a test sample sometime)

Agreed, 50-50 could potentially cause issues with assembly, especially if some DNA regions are highly similar between the two organisms in the culture.

Re:

My main concern with the short-read classification approach is that it needs a lot of RAM. There might be simpler ways to do this.

I wonder if read classification tools based on long reads might not have the same high RAM requirements as the short read tools? I imagine the long read versions might not be kmer-based in the same way. But also not sure about the runtimes of these tools -- I haven't tested them much before. (I guess you could take a subset of your read data to shorten runtimes)

LeeBergstrand commented 8 months ago

Most of these taxonomic classifications might be from the same phylum, but they might have contrasting taxonomy names at lower ranks.

Kaiju works at the protein level. In my experience, when Kaiju fails to find a good match, it will bump up the individual classification of a read to a higher taxonomic level. I assume that with a novel organism at 50% abundance and a contaminant organism at 50% abundance, you would get 50% of the reads only classified to phylum or class level (the novel organism), and the others would be classified down to family or genus level (the contaminant). I assume the contaminants will have more data about them in NCBI databases. You could still use that split, highly classified vs less classified, to determine if more than one organisms is present. But yes, you might need to be very aggressive with the kaiju cut-offs for this to ensure it doesn't eagerly misclassify. I'll check in with you on this in the next call.

jmtsuji commented 8 months ago

Like we discussed today:

Let's add a CheckM2-based warning about contamination to the rotary pipeline
Pre-assembly checks for multiple genomes (e.g., using Kaiju) is out of scope, at least for the rotary MVP. This might work better as a standalone pre-run QC pipeline.

I will close this issue and open one specific for the CheckM2 warning. Thanks for this discussion!

rotary-genomics / rotary

Detect if multiple genomes are present #85