Closed dlaehnemann closed 6 years ago
Even though this generates some external dependencies, I think a very small Snakemake pipeline in the examples folder is reasonably easy to set up and to reproduce for other people. So this will not become an automated test, but something that users can easily reproduce to test their output.
Concerning BWA-MEM, the README on their github pages states:
Note: minimap2 has replaced BWA-MEM for PacBio and Nanopore read alignment. It retains all major BWA-MEM features, but is ~50 times as fast, more versatile, more accurate and produces better base-level alignment.
So maybe we should settle for minimap2? It is also available on Bioconda here.
So maybe we should settle for minimap2? It is also available on Bioconda here.
Only if you actually simulate PacBio or ONT long read data. For Illumina reads, BWA-MEM (or another classical short read mapper) is still the way to go.
Mkay actually this pipeline we're about to build is just for checking the new single nucleotide error function we implemented in mdasim 2.0 (presence and distribution of errors), but I guess you're right anyway, even if we're not dealing with Illumina or any kind of reads at this point.
I agree with what Andi said about actual read data. In this case, it's actually a mix of both read types: long sequences generated by MDA that are more in the range of PacBio and ONT, while the error rate is even below that of Illumina. As minimap2 also offers full genome alignments as a possible use case, this sounds like a good option to me. I'd check out the minimap2 man page for options that apply to our setting, the general setting -x asm5
looks like it's made for our use case...
We have a first approach to an analysis pipeline now which can be found in examples/testing
. It runs minimap
-> samtools view
-> samtools sort
-> samtools mpileup
-> summary.py
and prints a file that summarizes all positions where the alignment finds a SNP, sorted by how often that SNP occurs:
#pos ref alt cnt
35678 C T 167
...
205115 C A 1
I did a few sample checks (cnt in the summary <=> IGV output), to make sure the code works properly so far.
However I have some remaining issues/questions:
196294 T K 1
. I'm wondering where the K
is coming from and whether I should just ignore anything but A,T,C,G
.Regarding 1.
, looking at the samtools mpileup
output, it looks like there is mapping quality information in ASCII code after the read start symbol ^
, so we should try to either get rid of this via (the absence of) the command-line option, and if that doesn't work, we need to skip that in parsing.
Regarding 2.
: That is pretty much how I would do it: Count the substitutions by type (e.g. A->G
) in the direct mdasim output and also get these counts from the checkup pipeline -- they should match up, with hopefully no differences. If there are differences, we need to investigate those.
For the checkup pipeline output: I really like the inclusion of the counts, these will be important for amplification bias plots and calculations later on. To that end, we should also include the count of reference nucleotides covering that site.
Concerning 1: I could not find a cmd-line option to get rid of the ^
s, so I changed the Python script to skip them and the quality char that follows them.
Concerning 2: I implemented another rule which generates a comparison.txt
, which, for my testing sequence, looks like:
#substitution log_cnt sum_cnt
A to T 42 46
A to C 46 43
A to G 50 52
T to A 43 39
T to C 51 49
T to G 45 46
C to A 42 44
C to T 42 51
C to G 63 72
G to A 47 38
G to T 51 48
G to C 51 42
As you can see, the rates are not quite equal. I can imagine, that some errors get lost in the mapping process, but I don't know for sure.
If you want to run the pipeline yourself, I also added a README for easier usage and also for potential future users.
Oh and the count of reference nucleotides can also be found in summary.txt
now of course:
#pos ref alt r_cnt a_cnt
35678 C T 890 167
103604 T C 359 166
You are right, the differences between the counts generated by mdasim itself (log_cnt
) and those generated from the mapping (sum_cnt
) are strange and probably come down to mapping / alignment issues. But for a proper ground truth generation, we will probably have to sort this out in more detail. So here are our ideas from the videocall earlier:
summary.txt
. I.e. look at aln_sorted.bam with samtools tview -p chr:pos
or some interactive bam viewer, e.g. IGV (installable via bioconda). It's probably a good idea to check this for both directions of deviations at least once (i.e. both for more substitutions found via alignment and for less substitutions found via alignment).a_cnt / (a_cnt + r_cnt)
over all substitutions.As I also wrote in the email earlier, it seems that the absolute error rates remain the same, while the relative rate drops to max. 8%: (Index := line in comparison.txt, a type of substitution)
Plotting the relative frequencies of amplification errors recorded in summary.txt resulted in the following rates of a_cnt / (a_cnt + r_cnt)
, sorted by position on the genome:
For 3, I think I didn't elaborate enough: I was thinking of a histogram where the x-axis is the relative frequency (all possible frequencies from 0 to 1) and the y-axis is the amount of positions with errors that have the respective frequency. Does this make sense?
And I'm curios if 1.
will give us any more insight in the exact reason for the discrepancies that seem to stay similar in absolute numbers...
For 3, I think I understood now what you wanted: The second histogram shows everything from the second bar on, since they were so small in the first diagram.
For 1, that'll be the next thing to investigate then.
Yep, that looks like what I was thinking about and looks like what I'd expect (the vast majority being low-frequent errors, but with a long tail up to substantial frequencies). Only the axis labels seem swapped?
Oh yes, I missed that one, sorry. Should I reupload the corrected images for documentation purposes?
Nah, don't worry -- it's documented now... ;)
Just a quick recap of our earlier discussion: We'll try to get mdasim to not only report how many substitutions it creates, but also where exactly (as in: position on the starting fasta sequence). This should give us a "perfect ground truth" instead of the mapping workaround. However, if this does get too complicated, we can always turn back to figuring things out in this issue...
I think I found a way to figure out the positions, however, if I compare the log with the positions and the substitutions to the mapping, I get rather bad rates of logSNPS==alnSNPs - too good to say, I just print the wrong positions, but out of 375 SNPs in the 210kB sequence, the alignment process only finds
I'll upload the current output into the nextCloud storage for you to investigate if you want to.
Regarding the sites reported in comparison_exp.txt
, I would say:
6345
, the reference base is a G
, but mdasim reports a C -> G
change -- this points to the substitution happening in the strand complementary to the reference strand. So we would want to adjust the log file to always report the old and the new reference strand nucleotide and maybe also the actual old and new base in the substitution in further columns.awk
) and point to the strand problem as the bullet point right above (nucleotide 18068
in the reference sequence is a T
, as reported by the alignment, not an A
as mdasim
reports)The rest of the sites are either only reported by mdasim (46
), or only found in the mapping / alignment (47
). While the positions don't match, the similar numbers of sites got me going on a little hunt, and here's the pattern I got:
If you take the distances between the sites within each of these sets, they agree almost fully, but with the order of the distances different. So while the first three sites reported only in mdasim have the distances:
1948
3580
2229
...
The distances for the last three sites reported via the alignment are:
...
2229
3580
1948
The only discrepancy creeps in, where one distance reported only for mdasim (36888
) is broken down into two consecutive distances reported by the alignment (16699, 20189
). This probably points to a genuine problem with the alignment (also, the extra site reported has an N
as the substituted nucleotide?).
However, I haven't fully understood this same distances but reversed order phenomenon, but I think this might also be an artefact of strandedness, as above -- so could you double-check that sites are counted correctly with regard to the reference, even when looking at the complementary strand? If that doesn't fix it, we'll have to dig deeper.
But all in all, this looks very promising, already!
This testing pipeline seems to be paying off, here! ;)
Quick addition to the above: The actual substitutions in the 46
sites reported only by either mdasim
or the mapping/alignment also match -- with the complementarity caveat from above. So this does sound like some counting direction problem in the mdasim
logging output to me.
By considering reverse order of positions on the reverse strand and complementary bases on the original strand when substitutions happen on the reverse one, I managed to get a coverage of 321 substitutions in the alignment that are the same as in the log for the 210kB-sequence. Only few errors remain:
The problematic positions are listed below:
#ref_pos ref_ref alt_ref aln_pos aln_ref aln_alt ref=aln
44946 A G 44946 A C -
44946 A C - - -
48265 G C - - -
160470 C T 160470 C T true
160470 C G - - -
178722 T G - - -
- - - 181055 C N
191866 T C - - -
The positions that were hit twice do sound possible, even though still quite unlikely -- let's look at it as a Bernoulli process, where we draw the particular nucleotide as many times as we see coverage and every time we can either incorporate an error or not:
2.95^-6
(current default in mdasim 2, derived from an issue in our phylogenetic single cell pipeline )44946
above has a ref coverage of 1513
, which I'll use to approximate total coverage(cov over n_err) * p_err^n_err * (1 - p_err)^(cov - n_err)
= 1513! / ( 2! * (1513 - 2)! ) * ( 2.95^(-6) )^2 * ( 1 - 2.96^(-6) )^1511
= 0.2655
Interestingly, this probability is of two errors at a site with coverage 1513
is both higher than that for one error (0.2311
) and that for three errors (0.2032
). For the lower coverage of 950
at site 160470
this does not hold true -- here one error has the maximum probability of 0.3411
, while two errors only have a probability of 0.2460
. But all in all, this sounds like it is reasonable that this happens in the process, so we shouldn't worry about the random number generator, here.
TODO: Make the alignment / mapping pipeline identify more than one error allele if present.
TODOs:
The three positions not detected in the mapping pipeline are probably due to alignment artefacts.
TODO:
One more systematic way to know a little more about them, would be to add a second run of mpileup to the mapping pipeline, where we explicitly feed in the positions that mdasim reported that it has altered--this way we should at least get coverage information for these sites. But I'm not yet sure if this is worth the effort.
At position 181055
, we get ...CCCCccCCc-6nnnnnnCcCCccCCcCC...
so a 6 bases long deletion. This error arises from mpileup
:
A pattern
\\+[0-9]+[ACGTNacgtn]+
indicates there is an insertion between this reference position and the next reference position. The length of the insertion is given by the integer in the pattern, followed by the inserted sequence. Similarly, a pattern-[0-9]+[ACGTNacgtn]+
represents a deletion from the reference.
An investigation of the mpileup output does not show any error nucleotides, just the reference one.
mpileup
by explicitly feeding in the positions that mdasim reported that it has altered.@VSack: With these past two commits, the testing pipeline confirms that your implementation makes sense -- so I am closing this issue. Thanks!
Once we have the single nucleotide error log from issue #1, we should try to create some kind of sanity check workflow for the error introduction. I think this will have to use external tools, so in this repo I would keep it to instructions on how to do the sanity check and an example dataset we can do it on (probably what already exists in the
examples/
folder).To start with, we could align the mdasim-amplified fasta sequences to the original sequence, first visually inspect them and then generate some pileup data with the following tools, all of which should easily install through bioconda:
bwa mem
as in the 1st example here (with a fasta file of the "amplified" DNA fragments instead of a fastq file for the "reads"): https://github.com/lh3/bwa#1-what-types-of-data-does-bwa-work-withsamtools tview
(this used to not work in the bioconda version, but should in the most recent version ofsamtools 1.6
-- but if it doesn't, let me know, and we can check for alternatives), see the user manual on the command line or here: http://www.htslib.org/doc/samtools.htmlsamtools mpileup
should generate all the necessary output (in the non-VCF/non-BCF plain text format), also check out the samtools docs: http://www.htslib.org/doc/samtools.html