swo / caravan

a 16S pipeline
0 stars 3 forks source link

dealing with all edgecases of merging reads #1

Open elsherbini opened 8 years ago

elsherbini commented 8 years ago

I have a bunch of ~250bp paired reads, and the region between the primers is only 215 bases. (This is for a Vibrio HSP60 amplicon). Currently the best_merge_position function in merge.py doesn't allow for this.

elsherbini commented 8 years ago

working on a pull request to make merge work in this case and the original use case in hopefully some elegant way.

swo commented 8 years ago

Like, they're staggered?

--FORWARD REVERSE-- On Fri, Dec 11, 2015 at 12:17 PM Joseph Elsherbini notifications@github.com wrote:

working on a pull request to make merge work in this case and the original use case in hopefully some elegant way.

— Reply to this email directly or view it on GitHub https://github.com/swo/caravan/issues/1#issuecomment-163994771.

elsherbini commented 8 years ago

like this:

forward -------------CATCGACAAAGCGGTTATCGCGGCTGTTGAAGAGCTGAAGAACCTTTCTGTTCCTTGTTCAGACACGAAAGCTATCGCGCAAGTAGGTACTATCTCTGCGAACTCTGATTCGACAGTAGGTAACATCATTGCTGAAGCGATGGAAAAAGTAGGCCGTGATGGCGTTATCACTGTTGAAGAAGGTCAGGCTCTACAAGACGAGCTAGACGTAGTTGAAGGCATG
reverse ATCTTAAGCGCGGCATCGACAAAGCGGTTATCGCGGCTGTTGAAGAGCTGAAGAACCTTTCTGTTCCTTGTTCAGACACGAAAGCTATCGCGCAAGTAGGTACTATCTCTGCGAACTCTGATTCGACAGTAGGTAACATCATTGCTGAAGCGATGGAAAAAGTAGGCCGTGATGGCGTTATCACTGTTGAAGAAGGTCAGGCTCTACAAGACGAGCTAGACGTAGTTG--------

The primers are designed to capture a 215 bp fragment, and the reads were longer than that, into the primer on the other side.

elsherbini commented 8 years ago

Yes, exactly like you said. I didn't notice I had to click to expand your blockquote.

swo commented 8 years ago

Interesting!

A question: Is "size" the useful way to talk about your expectations for staggered reads? I'm used to saying that I should have a 253 bp amplicon. If there's some other way that's useful to look at staggered reads (e.g., by overhang? idk) then maybe it's worth breaking out into another function (merge_stagger or something) that will have a different interface.

A request: If you know how to write tests, could you put an example of what staggered reads looks like and how they should merge into a new test/merge.py (or test/merge_stagger.py or whatever)? That would help me know what we're looking at.

A thought: I'm a little wary of this setup because it requires that all the possible product sizes be staggered or not. Maybe this is always the case in practical settings. I'm just nervous that, if I ever have two 150 bp reads and I ask for a 150 +/- 5 bp merged sequence, this setup will break because the 145-149 size products are overlapped and the 151-155 products are staggered.

elsherbini commented 8 years ago

fixed by acc221b ?

swo commented 8 years ago

Yes, I think so.

Also, you should check out my merge sizer https://github.com/swo/caravan/blob/master/tools/merge-sizer.py. It's rough around the edges, but I find this a really easy way to figure out if my reads are going to merge the way I expect, in the not/staggered position I expect, at the size I expect.

On Thu, May 19, 2016 at 9:42 PM Joseph Elsherbini notifications@github.com wrote:

fixed by acc221b https://github.com/swo/caravan/commit/acc221b9bcdc7af6135155cc854bf3d72dadb601 ?

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/swo/caravan/issues/1#issuecomment-220429914

elsherbini commented 8 years ago

Here is the problem at hand, reads can overlap in lots of ways:

# size=14
xxxOVERLAP1---
---OVERLAP2yyy
# expected output: xxx + consensus(OVERLAP1, OVERLAP2) + yyy

# size=8
---OVERLAP1xxx
yyyOVERLAP2---
# expected output: consensus(OVERLAP1,OVERLAP2)

# size=8
OVERLAP1
OVERLAP2
# expected output: consensus(OVERLAP1, OVERLAP2)

# size=11
xxxOVERLAP1
---OVERLAP2
# expected output: xxx + consensus(OVERLAP1, OVERLAP2)

# size=11
OVERLAP1---
OVERLAP2yyy
# expected output: consensus(OVERLAP1, OVERLAP2) + yyy

# no overlap, produce error:
# size > len(x) + len(y)
xxx---
---yyy
# expected output: error

I'm working on a solution (potentially with the help of stack overflow) to deal with all those cases, hopefully without devolving into the dozen if statement approach.

swo commented 8 years ago

I'm confused re: case 2. I would expect that, if staggering is off, you get consensus(OVERLAP1, OVERLAP2), but if staggering is on, yyy + consensus + xxx.

On Fri, Jul 1, 2016 at 1:45 PM Joseph Elsherbini notifications@github.com wrote:

Here is the problem at hand, reads can overlap in lots of ways:

size=11

xxxOVERLAP--- ---OVERLAPyyy

expected output: xxx + consensus(xOVERLAP, yOVERLAP) + yyy

size=7

---OVERLAPxxx yyyOVERLAP---

expected output: consensus(xOVERLAP, yOVERLAP)

size=7

OVERLAP OVERLAP

expected output: consensus(xOVERLAP, yOVERLAP)

size=7

xxxOVERLAP ---OVERLAP

expected output: xxx + consensus(xOVERLAP, yOVERLAP)

size=10

OVERLAP--- OVERLAPyyy

expected output: consensus(xOVERLAP, yOVERLAP) + yyy

no overlap, produce error:

xxx--- ---yyy

expected output: error

I'm working on a solution (potentially with the help of stack overflow) to deal with all those cases, hopefully without devolving into the dozen if statement approach.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/swo/caravan/issues/1#issuecomment-230007623, or mute the thread https://github.com/notifications/unsubscribe/ACbmk24V9r4qO0eTSq7LuSZJ5edg0EY0ks5qRVJUgaJpZM4Gzvhe .

elsherbini commented 8 years ago

The resulting merged sequence should always start with the beginning of x and end with the end of y.

swo commented 8 years ago

OK, so the overhang on staggered reads should always be deleted? (This is what usearch does: http://drive5.com/usearch/manual/cmd_fastq_mergepairs.html) Is there ever a case when you would want to keep the overhangs for staggered reads? (I suppose I had guessed that, because you often want to keep the overhangs on non-staggered reads, you might also want to keep them for staggered reads.)

On Fri, Jul 1, 2016 at 2:11 PM Joseph Elsherbini notifications@github.com wrote:

The resulting merged sequence should always start with the beginning of x and end with the end of y.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/swo/caravan/issues/1#issuecomment-230013235, or mute the thread https://github.com/notifications/unsubscribe/ACbmk4GROTRxjMztq8bAhUL644FASNvwks5qRVhegaJpZM4Gzvhe .

elsherbini commented 8 years ago

Yeah, always want to delete the overhang since it is the primer and adapter of the other direction. Didn't realize usearch already implemented this, very nice. No source available though?

swo commented 8 years ago

Nope, usearch is emphatically closed source: the full version costs like $10k.

I guess "overhang" doesn't make sense for non-staggered reads: "staggered" means "it has an overhang". Then agreed, we should just always toss it!

On Fri, Jul 1, 2016 at 2:18 PM Joseph Elsherbini notifications@github.com wrote:

Yeah, always want to delete the overhang since it is the primer and adapter of the other direction. Didn't realize usearch already implemented this, very nice. No source available though?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/swo/caravan/issues/1#issuecomment-230014763, or mute the thread https://github.com/notifications/unsubscribe/ACbmk8QmkM8W0R5BEvCURGQ8SOUqMlKbks5qRVoIgaJpZM4Gzvhe .