test output for presence and correctness of single nucleotide errors

dlaehnemann commented 6 years ago

Once we have the single nucleotide error log from issue #1, we should try to create some kind of sanity check workflow for the error introduction. I think this will have to use external tools, so in this repo I would keep it to instructions on how to do the sanity check and an example dataset we can do it on (probably what already exists in the examples/ folder).

To start with, we could align the mdasim-amplified fasta sequences to the original sequence, first visually inspect them and then generate some pileup data with the following tools, all of which should easily install through bioconda:

For alignment, I would just go for bwa mem as in the 1st example here (with a fasta file of the "amplified" DNA fragments instead of a fastq file for the "reads"): https://github.com/lh3/bwa#1-what-types-of-data-does-bwa-work-with
You can visually inspect such an alignment with samtools tview (this used to not work in the bioconda version, but should in the most recent version of samtools 1.6 -- but if it doesn't, let me know, and we can check for alternatives), see the user manual on the command line or here: http://www.htslib.org/doc/samtools.html
To compare it against the single nucleotide error log, samtools mpileup should generate all the necessary output (in the non-VCF/non-BCF plain text format), also check out the samtools docs: http://www.htslib.org/doc/samtools.html

dlaehnemann commented 6 years ago

Even though this generates some external dependencies, I think a very small Snakemake pipeline in the examples folder is reasonably easy to set up and to reproduce for other people. So this will not become an automated test, but something that users can easily reproduce to test their output.

VSack commented 6 years ago

Concerning BWA-MEM, the README on their github pages states:

Note: minimap2 has replaced BWA-MEM for PacBio and Nanopore read alignment. It retains all major BWA-MEM features, but is ~50 times as fast, more versatile, more accurate and produces better base-level alignment.

So maybe we should settle for minimap2? It is also available on Bioconda here.

abremges commented 6 years ago

So maybe we should settle for minimap2? It is also available on Bioconda here.

Only if you actually simulate PacBio or ONT long read data. For Illumina reads, BWA-MEM (or another classical short read mapper) is still the way to go.

VSack commented 6 years ago

Mkay actually this pipeline we're about to build is just for checking the new single nucleotide error function we implemented in mdasim 2.0 (presence and distribution of errors), but I guess you're right anyway, even if we're not dealing with Illumina or any kind of reads at this point.

dlaehnemann commented 6 years ago

I agree with what Andi said about actual read data. In this case, it's actually a mix of both read types: long sequences generated by MDA that are more in the range of PacBio and ONT, while the error rate is even below that of Illumina. As minimap2 also offers full genome alignments as a possible use case, this sounds like a good option to me. I'd check out the minimap2 man page for options that apply to our setting, the general setting -x asm5 looks like it's made for our use case...

VSack commented 6 years ago

We have a first approach to an analysis pipeline now which can be found in examples/testing. It runs minimap -> samtools view -> samtools sort -> samtools mpileup -> summary.py and prints a file that summarizes all positions where the alignment finds a SNP, sorted by how often that SNP occurs:

#pos    ref alt cnt
35678   C   T   167
...
205115  C   A   1

I did a few sample checks (cnt in the summary <=> IGV output), to make sure the code works properly so far.

However I have some remaining issues/questions:

There are also lines like 196294 T K 1. I'm wondering where the K is coming from and whether I should just ignore anything but A,T,C,G.
Also, how are we actually going to compare these errors to the ones in the log, where we know only the chronology of the errors but not their position? My idea would be a statistical distribution to check the expected number of a type of substitution (A to G for example) based on that chronology (the earlier, the more expected) compared to the sum of substitutions of this type that we get when analysing the output.

dlaehnemann commented 6 years ago

Regarding 1., looking at the samtools mpileup output, it looks like there is mapping quality information in ASCII code after the read start symbol ^, so we should try to either get rid of this via (the absence of) the command-line option, and if that doesn't work, we need to skip that in parsing.

Regarding 2.: That is pretty much how I would do it: Count the substitutions by type (e.g. A->G) in the direct mdasim output and also get these counts from the checkup pipeline -- they should match up, with hopefully no differences. If there are differences, we need to investigate those.

For the checkup pipeline output: I really like the inclusion of the counts, these will be important for amplification bias plots and calculations later on. To that end, we should also include the count of reference nucleotides covering that site.

VSack commented 6 years ago

Concerning 1: I could not find a cmd-line option to get rid of the ^s, so I changed the Python script to skip them and the quality char that follows them. Concerning 2: I implemented another rule which generates a comparison.txt, which, for my testing sequence, looks like:

#substitution   log_cnt     sum_cnt
A to T          42          46
A to C          46          43
A to G          50          52
T to A          43          39
T to C          51          49
T to G          45          46
C to A          42          44
C to T          42          51
C to G          63          72
G to A          47          38
G to T          51          48
G to C          51          42

As you can see, the rates are not quite equal. I can imagine, that some errors get lost in the mapping process, but I don't know for sure.

If you want to run the pipeline yourself, I also added a README for easier usage and also for potential future users.

Oh and the count of reference nucleotides can also be found in summary.txt now of course:

#pos    ref alt r_cnt   a_cnt
35678   C   T   890     167
103604  T   C   359     166

dlaehnemann commented 6 years ago

You are right, the differences between the counts generated by mdasim itself (log_cnt) and those generated from the mapping (sum_cnt) are strange and probably come down to mapping / alignment issues. But for a proper ground truth generation, we will probably have to sort this out in more detail. So here are our ideas from the videocall earlier:

For a substitution type, where mdasim counts and mapping counts are rather far apart, look at the alignment at the positions given in summary.txt. I.e. look at aln_sorted.bam with samtools tview -p chr:pos or some interactive bam viewer, e.g. IGV (installable via bioconda). It's probably a good idea to check this for both directions of deviations at least once (i.e. both for more substitutions found via alignment and for less substitutions found via alignment).
Check out stats for a larger genome region and see if the differences remain at the same relative level (20%) or if there might be another reason for them, so that they will remain at similar absolute count levels but will disappear relatively.
Plot relative frequencies of amplification errors recorded in summary.txt, i.e. a density plot or a histogram of a_cnt / (a_cnt + r_cnt) over all substitutions.

VSack commented 6 years ago

As I also wrote in the email earlier, it seems that the absolute error rates remain the same, while the relative rate drops to max. 8%: (Index := line in comparison.txt, a type of substitution)
Plotting the relative frequencies of amplification errors recorded in summary.txt resulted in the following rates of a_cnt / (a_cnt + r_cnt), sorted by position on the genome:

dlaehnemann commented 6 years ago

For 3, I think I didn't elaborate enough: I was thinking of a histogram where the x-axis is the relative frequency (all possible frequencies from 0 to 1) and the y-axis is the amount of positions with errors that have the respective frequency. Does this make sense?

And I'm curios if 1. will give us any more insight in the exact reason for the discrepancies that seem to stay similar in absolute numbers...

VSack commented 6 years ago

For 3, I think I understood now what you wanted: frequencies_all frequencies_lower The second histogram shows everything from the second bar on, since they were so small in the first diagram.

For 1, that'll be the next thing to investigate then.

dlaehnemann commented 6 years ago

Yep, that looks like what I was thinking about and looks like what I'd expect (the vast majority being low-frequent errors, but with a long tail up to substantial frequencies). Only the axis labels seem swapped?

VSack commented 6 years ago

Oh yes, I missed that one, sorry. Should I reupload the corrected images for documentation purposes?

dlaehnemann commented 6 years ago

Nah, don't worry -- it's documented now... ;)

dlaehnemann commented 6 years ago

Just a quick recap of our earlier discussion: We'll try to get mdasim to not only report how many substitutions it creates, but also where exactly (as in: position on the starting fasta sequence). This should give us a "perfect ground truth" instead of the mapping workaround. However, if this does get too complicated, we can always turn back to figuring things out in this issue...

VSack commented 6 years ago

I think I found a way to figure out the positions, however, if I compare the log with the positions and the substitutions to the mapping, I get rather bad rates of logSNPS==alnSNPs - too good to say, I just print the wrong positions, but out of 375 SNPs in the 210kB sequence, the alignment process only finds

168 correct ones
48 where the reference and the alternative seem to be swapped
160 that are not in the log-file.
Finally, 159 positions are not found by the alignment at all.

I'll upload the current output into the nextCloud storage for you to investigate if you want to.

dlaehnemann commented 6 years ago

Regarding the sites reported in comparison_exp.txt, I would say:

the 168 correct ones are great and already document that the log output of mdasim reports the correct positions
the 48 swaps sound like the reporting of mdasim might be the wrong way around, here: It does not necessarily report the original base in the reference as the starting base, but the base that it is changing, which might be the complementary strand. E.g. at position 6345, the reference base is a G, but mdasim reports a C -> G change -- this points to the substitution happening in the strand complementary to the reference strand. So we would want to adjust the log file to always report the old and the new reference strand nucleotide and maybe also the actual old and new base in the substitution in further columns.
the 113 where the positions agree, but nucleotides do not match, are actually all just complementary nucleotides (I tested with awk) and point to the strand problem as the bullet point right above (nucleotide 18068 in the reference sequence is a T, as reported by the alignment, not an A as mdasim reports)

The rest of the sites are either only reported by mdasim (46), or only found in the mapping / alignment (47). While the positions don't match, the similar numbers of sites got me going on a little hunt, and here's the pattern I got: If you take the distances between the sites within each of these sets, they agree almost fully, but with the order of the distances different. So while the first three sites reported only in mdasim have the distances:

The distances for the last three sites reported via the alignment are:

The only discrepancy creeps in, where one distance reported only for mdasim (36888) is broken down into two consecutive distances reported by the alignment (16699, 20189). This probably points to a genuine problem with the alignment (also, the extra site reported has an N as the substituted nucleotide?). However, I haven't fully understood this same distances but reversed order phenomenon, but I think this might also be an artefact of strandedness, as above -- so could you double-check that sites are counted correctly with regard to the reference, even when looking at the complementary strand? If that doesn't fix it, we'll have to dig deeper.

But all in all, this looks very promising, already!

This testing pipeline seems to be paying off, here! ;)

dlaehnemann commented 6 years ago

Quick addition to the above: The actual substitutions in the 46 sites reported only by either mdasim or the mapping/alignment also match -- with the complementarity caveat from above. So this does sound like some counting direction problem in the mdasim logging output to me.

VSack commented 6 years ago

By considering reverse order of positions on the reverse strand and complementary bases on the original strand when substitutions happen on the reverse one, I managed to get a coverage of 321 substitutions in the alignment that are the same as in the log for the 210kB-sequence. Only few errors remain:

In this test, I had two positions, that seemed to experience two substitutions during the amplification. The alignment pipeline does not consider these cases at the moment, it only looks for ONE alternative base. Checking the positions in IGV revealed that actually both substitutions were found by the alignment process.
Three substitutions in the log were not found in the alignment
The alignment found a substitution "C to N", and I'm not yet sure where that comes from, but I'm investigating it.

The problematic positions are listed below:

#ref_pos    ref_ref     alt_ref     aln_pos     aln_ref     aln_alt     ref=aln
44946       A       G       44946       A       C        - 
44946       A       C       -       -       - 
48265       G       C       -       -       - 
160470      C       T       160470      C       T        true 
160470      C       G       -       -       - 
178722      T       G       -       -       - 
-       -       -       181055      C       N
191866      T       C       -       -       -

dlaehnemann commented 6 years ago

positions hit twice

The positions that were hit twice do sound possible, even though still quite unlikely -- let's look at it as a Bernoulli process, where we draw the particular nucleotide as many times as we see coverage and every time we can either incorporate an error or not:

the probability that a certain nucleotide incorporation introduces an error is: 2.95^-6 (current default in mdasim 2, derived from an issue in our phylogenetic single cell pipeline )
site 44946 above has a ref coverage of 1513, which I'll use to approximate total coverage

so the probability that two errors occur at that site will be:

(cov over n_err) * p_err^n_err * (1 - p_err)^(cov - n_err)
= 1513! / ( 2! * (1513 - 2)! ) * ( 2.95^(-6) )^2 * ( 1 - 2.96^(-6) )^1511
= 0.2655

Interestingly, this probability is of two errors at a site with coverage 1513 is both higher than that for one error (0.2311) and that for three errors (0.2032). For the lower coverage of 950 at site 160470 this does not hold true -- here one error has the maximum probability of 0.3411, while two errors only have a probability of 0.2460. But all in all, this sounds like it is reasonable that this happens in the process, so we shouldn't worry about the random number generator, here.

TODO: Make the alignment / mapping pipeline identify more than one error allele if present.

change to N detected in mapping pipeline

TODOs:

look at site with IGV
look at mpileup output at that position (some mpileup symbols we were not aware of?)

positions not detected in mapping pipeline

The three positions not detected in the mapping pipeline are probably due to alignment artefacts.

TODO:

look at mpileup output at these positions, to check if the error nucleotide is at least seen there

One more systematic way to know a little more about them, would be to add a second run of mpileup to the mapping pipeline, where we explicitly feed in the positions that mdasim reported that it has altered--this way we should at least get coverage information for these sites. But I'm not yet sure if this is worth the effort.

VSack commented 6 years ago

detected Ns in mapping pipeline

At position 181055, we get ...CCCCccCCc-6nnnnnnCcCCccCCcCC... so a 6 bases long deletion. This error arises from mpileup:

A pattern \\+[0-9]+[ACGTNacgtn]+ indicates there is an insertion between this reference position and the next reference position. The length of the insertion is given by the integer in the pattern, followed by the inserted sequence. Similarly, a pattern -[0-9]+[ACGTNacgtn]+ represents a deletion from the reference.

positions not detected in mapping pipeline

An investigation of the mpileup output does not show any error nucleotides, just the reference one.

TODOs:

[x] Make the alignment / mapping pipeline identify more than one error allele if present.
[x] Work around deletion(insertion) patterns or modify output to consider insertions/deletions?
[ ] Re-run mpileup by explicitly feeding in the positions that mdasim reported that it has altered.

dlaehnemann commented 6 years ago

@VSack: With these past two commits, the testing pipeline confirms that your implementation makes sense -- so I am closing this issue. Thanks!

hzi-bifo / mdasim