statgen / bamUtil

http://genome.sph.umich.edu/wiki/BamUtil
89 stars 30 forks source link

Cut middle of reads #53

Open shimbalama opened 5 years ago

shimbalama commented 5 years ago

Hi - I'd like to use clipoverlap on really bad somatic WGS FFPE data. A lot of my reads overlap completely and nearly all have some overlap. Many have insert size <100bp. (I didn't design the experiment...)

The way I understand it - if I have 2 reads of 100bp each that overlap by 80bp (insert size 120bp) then clipoverlap will give me 1 read of 20bp and 1 of 100bp. Is that right? What I need is 2 reads of 60bp as output. So trim both reads in the middle of the overlap. is that something you would be interested in doing? Just thought I'd check before re-inventing the wheel.

Regards,Liam

mktrost commented 5 years ago

Your understanding of how ClipOverlap works is correct. I do not have time at this moment to make that change. I would be able to offer advice and assist someone with the changes via feedback/answering questions. If I am understanding your request correctly, it shouldn't be too difficult of an update for someone familiar with C++. I'd hate for you to reinvent the wheel when I don't think it would actually be too difficult to update ClipOverlap. I designed ClipOverlap with the intent that someone may want to use different Clipping Logic.

Is updating C++ something you think you could do? I'm including all my basic thoughts below in case you or someone else is able to make these updates to ClipOverlap. Let me know what you think. Otherwise, maybe in the next few weeks I can throw a rough something together. It wouldn't be as thoroughly tested or as clean as I like. I would probably leave some testing up to you too.

I believe all the logic in ClipOverlap itself could remain the same, and it is only the actually clipping logic that needs to be modified. All of that logic should be contained in the Overlap Handler: OverlapClipLowerBaseQual.cpp (found in bamUtil/src: https://github.com/statgen/bamUtil/blob/master/src/OverlapClipLowerBaseQual.cpp).

There are 2 options 1: support either type of clipping (original clipOverlap or your new version); 2: support just your new method.

2) To support just your new method, you should be able to just modify OverlapClipLowerBaseQual to contain the clipping logic you want. Rather than figuring out which strand has lower quality, you would just calculate the clip positions you want based on the overlap and clip based on that.

1) To support your new method as well as the original ClipOverlap operation, it should just involve adding a new class OverlapSplitClip (or whatever you would want to call it). It would inherit from OverlapHandler, and be used instead of OverlapClipLowerBaseQual (those classes are both found in bamUtil/src). Then ClipOverlap.cpp would need to be updated to use a flag or something to tell it which overlap handling logic to use. I would make the current operation default and yours via a flag.

Let me know if you have any questions.

shimbalama commented 5 years ago

Thanks for your prompt and detailed response. I have no experience with C++, however, given its just one or two classes I might give it a go.

If you do have time yourself in the next few weeks I'm sure you're rough version would be better than mine and we'd be more than happy to test it (and give it a few decent citations).

Regards, Liam