Mock community: simulating error

When testing classifier performance against a mock community, several strategies are open to us, including:

Identify a set of sequenced isolates; identify proportions you want for your mock community (e.g. 1:1:1:1, 1:10:100:1000); subsample reads randomly (with replacement) from the "real" sequence data in the appropriate proportions. This emulates sequencing a "real" sample containing the selected isolates only. With random read sampling, this should contain an appropriate amount of realistic "read noise".
Define a threshold detection abundance t; identify two sets of sequenced isolates P and Q; let P be the "positives" in our data, and Q the negatives; sample n_Pi > t reads from "real" sequence data for each isolate i in P; sample n_Qi < t reads from "real sequence data for each isolate i in Q. Combine these reads. Then we have "positives" and "negatives" at a detection threshold t. This emulates isolates that are present, and isolates that would be considered not detectable at t.
Proceed as in (1); introduce additional "sequence noise" by random substitution/deletions to mimic the extent of noise we see in the synthetic control sequences (see figure). This overlays random noise on top of the already existing experimental noise from the "real" sequence data, and could be considered to represent a particularly bad dataset.

download

A disadvantage with (2) is that, if we want to see the effect on classifier performance of modifying the abundance detection threshold, our test set has been tuned to t, and a lower detection threshold will make the classifier appear to perform better. A motivated person who believed that lowering the detection threshold would give more positive results may be fooled by this.

The main advantage of (1) is that it's real data with no modifications - but as the samples used for our known isolates were carefully grown up in the lab and had abundant gDNA, this is very much a "best case scenario", and may not reflect the random mankiness and low-abundance performance of environmental samples.

While (3) is an artificial situation, it might - if the error profile was suitably harsh - reasonably reflect additional error over and above that seen for the single-isolate controls that could occur in a "real" environmental sample.

In case (1), all four species are expected to be reported by the classifier, regardless of the ratio being 1:1:1:1 or 1:10:100:1000. This is quite easy with seqtk but that does not do sampling with replacement (since duplicate read names in FASTQ files tends to break things). However, on making multiple communities it would be sampling with replacement (which is probably what you meant).

I've been exploring a variant of (2), and as soon as you change the intended threshold t you have a problem in terms of what the expected outcome is as the P vs Q boundary also moved. Note that read count and unique-sequence abundance are only correlated, which can be problematic for interpretation if generating the mock community using fixed read counts.

Applying (3) does seem to offer a good worst case scenario (and can be run with varying levels of added noise).

Yes, you're right about me messing up "with replacement": your interpretation of replacing between communities is correct. The same read should not be selected twice in a single mock community, but it may appear in more than one mock community.

Agreed, re: (2). I think that's a flaw in that approach, and we should de-emphasise it in favour of (1) and (3).

As we discussed earlier, there are many options for measuring classifier performance here, and it's not anything like as clear-cut as a binary classifier. It's a difficult problem and we're probably going to refine how we do this a few times before we're finished.

I think our first approach should be to estimate the classifier's performance at its main intended purpose: detecting which species are present in a sample. In the first instance, I think we can look at the ability to correctly recover the input species from samples of approaches (1) and (3), in the first instance.

Suppose we have a "real" (post-sequencing) environmental sample to present to the classifier. This is a heavily processed representation of what was present in the biological material - so what influences have there been?

the Phytophthora (and possibly other genera) species that were present in the sampled material
the sensitivity of the PCR amplification process has excluded low-abundance isolates
the sample collection and storage may have introduced contamination
the PCR amplification process has introduced bias in relative amplification
the PCR amplification process has introduced sequence variation
there may have been cross-contamination in the lab during amplification
there may have been sequencing errors

Clearly, with a mock community, we can't accurately represent everything that could influence an environmental sample, so it will never be an exact test.

However, we can apply a lot of what we've learned from the synthetic controls. These went in as (we assume) homogenous collections of sequence, and by the time they're ready to meet the classifier (post-sequencing) they are proxies for many of the processes listed above. We have estimates of:

overall sensitivity (we used different quantities of controls, and serial dilutions) so we know how much DNA there would have to be in the original sample to be detectable post-sequencing
we can estimate contamination levels, as there should be no recognisable ITS1 in the controls: all that we see is contamination
we see widespread sequence variation for these samples, which is a proxy for the PCR/sequencing error during the process

Now, when we make up mock communities from single-isolate controls, our input sequences to the mock communities have already been through this process (though perhaps less severely than "real" samples due to the quality of DNA prep and amount of DNA available). Moreover, some of those sequences have also been included as references in the database, as they are our best reference for the source species. Even though they carry some of the signatures of PCR bias and sequence variation, those single-isolate sequences must (I think) be considered as though they are "real" biological sequence inputs at the start of the fictional sequencing process of creating a mock community. We know that there will be reads belonging to those isolates that are not represented in the database - and where these are assigned to other species by the classifier they may be false positives, but we should anticipate that they will mostly be "unclassified". In type (1) mock communities they will not, however, carry the same level of "noisy" sequence variation that would expect to mask the biological signal in a "real" sample, when considered as part of the mock community.

This leaves us in a position where we will still need to put "extra" sequence variation in the mock community to get the type (3) community. So how should we do this?

I think that we can take the edit distance information from the synthetic control read figure above, and use this to add proportionately the same level of edit distance to randomly-chosen reads from the mock community. For each of the control sequences, we see a distribution of low-abundance (<1000) reads at each each edit distance up to about 30-35. The maximum (and presumably central) abundance count falls as edit distance increases. If we introduce into each mock community some "sequence noise" that looks like it has the same distribution as this, I think we will have gone a long way towards making the mock community's sequence variation for a known input/set of inputs quite representative of what we'd expect from a "real" sample.

The type (1) and (3) communities in the comment above contain no true negative examples to cause confusion. Any off-target identifications would be the result of ambiguous/mutated sequences that derive from a known positive input.

This doesn't reflect the potential for confusion by cross-contamination during sampling and processing.

To simulate this, we could - with a chosen probablilty - select one or more off-target sequenced single isolates, and sample reads for this isolate up to an abundance that is similar to that seen for the synthetic controls in the figure above. These reads would be "negatives" (in sample terms) whose identification by the classifier would be a false positive.

This might be taking verisimilitude too far. I'm still undecided as to whether we really need to emulate the mechanics of the sampling/experimental process to this extent.

Should we just be happy with the ability to determine overall ability to correctly classify reads, with the additional species-level performance estimates that we have now?

Is your/(our) time better spent making the classifier more user-friendly, and getting the output formats in shape?

peterjc / thapbi-pict

Mock community: simulating error #98