Open da-i opened 1 year ago
Hi! Apologies about the cryptic error and limitation of input chromosome lengths. We have not tried any amplicon sequencing with our method so far. Mainly because of the limited potential gain with such input data, i.e. for adaptive sampling you will get ~400-500 nt of each fragment anyway since some sequence is needed to identify the origin of a fragment. And this length might not be too different from the full-length amplicons in the first place. How long are the fragments in your library?
Our use case might indeed be a little bit different. We've developed CyclomicsSeq, A technique to boost the quality of Nanopore reads by generating repeating dna sequences prior to sequencing. The reads are N50 8kb +. the repeated sequence is 150 bases from the human genome and slightly over 300 bases for the repeating unit.
Ok, let's see if I got the composition of fragments: It's rolling circle amplification of XY, where X is 150b of human sequence, Y is some other repeating sequence of 300b and the sequenced fragments are then XYXY...XYXY alternating to make fragments with N50 of 8kb? And for different fragments X is unique, but Y might not be, I assume? Is it chance whether X or Y is at the end of sequenced fragments or do they always terminate with the same one? I'm wondering what the best design for references would be with these points in mind. There are two mapping steps to consider:
Can you share some more info about the references that you wanted to use?
Yes, in essence that is correct. however X can be other sizes as well. The sequence of Y in your example is known, and usually added to the reference genome. The sequence has been designed to not map to the human reference genome. The reference used in the test was a set of exons of a particular human gene, plus the sequence of Y (which we call backbones internally).
he alignment of the initial anchoring bases to find a read's origin and make the adaptive sampling decision (usually ~400-500b). These would consist of a mixed sequence from X and Y in possibly unknown order? We have no control over the start or end position of the reads, so they start randomly in the sequence of the XY product.
the alignment of the full-length reads for tracking coverage information. This could either be 300+150b references for each unique X or some number of XY repeats to map against? If it's a single repeat it would necessitate somehow chopping up reads at X-Y junctions before mapping. And if it was multiple repeats in the reference, the anchoring bases might be mapped to any one of them by chance, rendering decision making and coverage tracking very difficult. This is an interesting point that i did not take into consideration yet. For a proof of concept we can assume that there is no bias in product length, as we are not looking for perfect balancing, just a slight improvement. Thus all coverage would be overestimated by roughly 10x.
The reference used in the test was a set of exons of a particular human gene, plus the sequence of Y (which we call backbones internally).
Does that mean each exons would be equivalent to a separate X from my nomenclature above? So the reads you get from an experiment are
E1YE1Y...E1YE1Y,
E2YE2Y...E2YE2Y,
etc.?
And your initial tests were to use references something along the lines of E1Y, E2Y, etc. or E1, E2, Y?
I don't know how read mappers will behave in this situation, when the reads are longer than the reference to be honest. I would assume that either one of the repeats in a read maps to its short reference and the rest of the read is clipped, leaving you with an underestimation of coverage. Is that what you see? Or do you get mappings of all repeats in a read reported as supplementary alignments to the same reference? Just want to make sure I understand your experiments before giving any advice about how to use BOSS-RUNS in your case. And even then I'm not sure it will be straight forward in this scenario.
Does that mean each exons would be equivalent to a separate X from my nomenclature above? So the reads you get from an experiment are E1YE1Y...E1YE1Y, E2YE2Y...E2YE2Y, etc.?
Yes thats correct!
And your initial tests were to use references something along the lines of E1Y, E2Y, etc. or E1, E2, Y?
We've created a reference without Y, since i was afraid that that would lead to rejection of all reads.
With respect to mapping, aligning these with minimap is not an issue since the length is long enough for minimap/mappy. We usually get mappability rates (bases total/ bases mapped) around 0.9-1, when we do include the Backbone sequence (called Y in the about conversation) in the reference genome. This is what made us believe that a BOSS-RUNS like approach should work for our data.
Hi Author(s),
I think it is super cool that you've implemented this dynamic strategy!
We where trying to normalize some amplicons on a given chromosome, so we made sequences that where unique to these amplicons and provided these as a reference file. However the minimum genome size of 10k made this impossible.
There is an unclear error when no sequence is long enough in the reference genome
After some digging it apears that the mmi property at 604 is None. If we hardcode the mmi, and set min_len in BR_reference to 100: we get slightly further into the code:
If we print ref here, it is None
If we check the values at line 1426 in BR_core we get
If i hardcode ref here, i get slightly futher, as it is actually aligning reads:
This is where i decided to stop digging in the codebase to hardcode "fixes".
Is there a way to make this work? How would you procede? An alternative is to add 5kb of noise to the start and end of the amplicon sequences, but that feels very hacky.