Open elsherbini opened 8 years ago
working on a pull request to make merge
work in this case and the original use case in hopefully some elegant way.
Like, they're staggered?
--FORWARD REVERSE-- On Fri, Dec 11, 2015 at 12:17 PM Joseph Elsherbini notifications@github.com wrote:
working on a pull request to make merge work in this case and the original use case in hopefully some elegant way.
— Reply to this email directly or view it on GitHub https://github.com/swo/caravan/issues/1#issuecomment-163994771.
like this:
forward -------------CATCGACAAAGCGGTTATCGCGGCTGTTGAAGAGCTGAAGAACCTTTCTGTTCCTTGTTCAGACACGAAAGCTATCGCGCAAGTAGGTACTATCTCTGCGAACTCTGATTCGACAGTAGGTAACATCATTGCTGAAGCGATGGAAAAAGTAGGCCGTGATGGCGTTATCACTGTTGAAGAAGGTCAGGCTCTACAAGACGAGCTAGACGTAGTTGAAGGCATG reverse ATCTTAAGCGCGGCATCGACAAAGCGGTTATCGCGGCTGTTGAAGAGCTGAAGAACCTTTCTGTTCCTTGTTCAGACACGAAAGCTATCGCGCAAGTAGGTACTATCTCTGCGAACTCTGATTCGACAGTAGGTAACATCATTGCTGAAGCGATGGAAAAAGTAGGCCGTGATGGCGTTATCACTGTTGAAGAAGGTCAGGCTCTACAAGACGAGCTAGACGTAGTTG--------
The primers are designed to capture a 215 bp fragment, and the reads were longer than that, into the primer on the other side.
Yes, exactly like you said. I didn't notice I had to click to expand your blockquote.
Interesting!
A question: Is "size" the useful way to talk about your expectations for staggered reads? I'm used to saying that I should have a 253 bp amplicon. If there's some other way that's useful to look at staggered reads (e.g., by overhang? idk) then maybe it's worth breaking out into another function (merge_stagger or something) that will have a different interface.
A request: If you know how to write tests, could you put an example of what staggered reads looks like and how they should merge into a new test/merge.py
(or test/merge_stagger.py
or whatever)? That would help me know what we're looking at.
A thought: I'm a little wary of this setup because it requires that all the possible product sizes be staggered or not. Maybe this is always the case in practical settings. I'm just nervous that, if I ever have two 150 bp reads and I ask for a 150 +/- 5 bp merged sequence, this setup will break because the 145-149 size products are overlapped and the 151-155 products are staggered.
fixed by acc221b ?
Yes, I think so.
Also, you should check out my merge sizer https://github.com/swo/caravan/blob/master/tools/merge-sizer.py. It's rough around the edges, but I find this a really easy way to figure out if my reads are going to merge the way I expect, in the not/staggered position I expect, at the size I expect.
On Thu, May 19, 2016 at 9:42 PM Joseph Elsherbini notifications@github.com wrote:
fixed by acc221b https://github.com/swo/caravan/commit/acc221b9bcdc7af6135155cc854bf3d72dadb601 ?
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/swo/caravan/issues/1#issuecomment-220429914
Here is the problem at hand, reads can overlap in lots of ways:
# size=14
xxxOVERLAP1---
---OVERLAP2yyy
# expected output: xxx + consensus(OVERLAP1, OVERLAP2) + yyy
# size=8
---OVERLAP1xxx
yyyOVERLAP2---
# expected output: consensus(OVERLAP1,OVERLAP2)
# size=8
OVERLAP1
OVERLAP2
# expected output: consensus(OVERLAP1, OVERLAP2)
# size=11
xxxOVERLAP1
---OVERLAP2
# expected output: xxx + consensus(OVERLAP1, OVERLAP2)
# size=11
OVERLAP1---
OVERLAP2yyy
# expected output: consensus(OVERLAP1, OVERLAP2) + yyy
# no overlap, produce error:
# size > len(x) + len(y)
xxx---
---yyy
# expected output: error
I'm working on a solution (potentially with the help of stack overflow) to deal with all those cases, hopefully without devolving into the dozen if statement approach.
I'm confused re: case 2. I would expect that, if staggering is off, you get
consensus(OVERLAP1, OVERLAP2)
, but if staggering is on, yyy
+
consensus
+ xxx
.
On Fri, Jul 1, 2016 at 1:45 PM Joseph Elsherbini notifications@github.com wrote:
Here is the problem at hand, reads can overlap in lots of ways:
size=11
xxxOVERLAP--- ---OVERLAPyyy
expected output: xxx + consensus(xOVERLAP, yOVERLAP) + yyy
size=7
---OVERLAPxxx yyyOVERLAP---
expected output: consensus(xOVERLAP, yOVERLAP)
size=7
OVERLAP OVERLAP
expected output: consensus(xOVERLAP, yOVERLAP)
size=7
xxxOVERLAP ---OVERLAP
expected output: xxx + consensus(xOVERLAP, yOVERLAP)
size=10
OVERLAP--- OVERLAPyyy
expected output: consensus(xOVERLAP, yOVERLAP) + yyy
no overlap, produce error:
xxx--- ---yyy
expected output: error
I'm working on a solution (potentially with the help of stack overflow) to deal with all those cases, hopefully without devolving into the dozen if statement approach.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/swo/caravan/issues/1#issuecomment-230007623, or mute the thread https://github.com/notifications/unsubscribe/ACbmk24V9r4qO0eTSq7LuSZJ5edg0EY0ks5qRVJUgaJpZM4Gzvhe .
The resulting merged sequence should always start with the beginning of x and end with the end of y.
OK, so the overhang on staggered reads should always be deleted? (This is what usearch does: http://drive5.com/usearch/manual/cmd_fastq_mergepairs.html) Is there ever a case when you would want to keep the overhangs for staggered reads? (I suppose I had guessed that, because you often want to keep the overhangs on non-staggered reads, you might also want to keep them for staggered reads.)
On Fri, Jul 1, 2016 at 2:11 PM Joseph Elsherbini notifications@github.com wrote:
The resulting merged sequence should always start with the beginning of x and end with the end of y.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/swo/caravan/issues/1#issuecomment-230013235, or mute the thread https://github.com/notifications/unsubscribe/ACbmk4GROTRxjMztq8bAhUL644FASNvwks5qRVhegaJpZM4Gzvhe .
Yeah, always want to delete the overhang since it is the primer and adapter of the other direction. Didn't realize usearch already implemented this, very nice. No source available though?
Nope, usearch is emphatically closed source: the full version costs like $10k.
I guess "overhang" doesn't make sense for non-staggered reads: "staggered" means "it has an overhang". Then agreed, we should just always toss it!
On Fri, Jul 1, 2016 at 2:18 PM Joseph Elsherbini notifications@github.com wrote:
Yeah, always want to delete the overhang since it is the primer and adapter of the other direction. Didn't realize usearch already implemented this, very nice. No source available though?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/swo/caravan/issues/1#issuecomment-230014763, or mute the thread https://github.com/notifications/unsubscribe/ACbmk8QmkM8W0R5BEvCURGQ8SOUqMlKbks5qRVoIgaJpZM4Gzvhe .
I have a bunch of ~250bp paired reads, and the region between the primers is only 215 bases. (This is for a Vibrio HSP60 amplicon). Currently the
best_merge_position
function inmerge.py
doesn't allow for this.