s4hts / HTStream

A high throughput sequence read toolset using a streaming approach facilitated by Linux pipes
https://s4hts.github.io/HTStream/
Apache License 2.0
49 stars 9 forks source link

UMI Support for hts_SuperDeduper! #261

Closed bnjenner closed 4 months ago

bnjenner commented 4 months ago

Hello again,

I have added support for UMI-based PCR deduplication. It functions by just extracting the UMI from the read ID, converting it to bits, and appending it onto the sequence key used for PCR duplicate removal, essentially just extending the sequence. I also added an extra test function in hts_SuperDeduper_test that was useful during development. Additionally, I have ran this new version of the tool on a variety of datasets and it appears to work as intended. I originally wanted to provide some data for this PR, but ultimately I realized that testing fell more in line with bench marking then validating the algorithm. That being said, if you would like to see some of these results, I would be happy to add what I have found. Although, it is worth noting that the effectiveness of this method on single end reads, particularly TAGseq experiments with lower complexity, while comparable to umi_tools, is ultimately worse and for some reason very sensitive to whether or not you use hts_CutTrim before it. I am currently working on finding the best parameter settings for its application to TAGseq.

Anyways, let me know what you guys think, I am excited to finally have some eyes on this one.

bnjenner commented 4 months ago

In hindsight, umitools considers the edit distance when considering the umi in deduplication... let me know if you guys would prefer the implementation include this.

msettles commented 3 months ago

somewhere however, now incorrect QUAL characters are getting introduced. Brad have you seen this anywhere else on this branch? In hts_Overlapper

bnjenner commented 3 months ago

I have not. Is this an issue present in the test dataset? Also, what is the exact issue? Is it just characters some how being introduced or switched?

On Thu, Jun 6, 2024 at 9:36 AM Matt Settles @.***> wrote:

somewhere however, now incorrect QUAL characters are getting introduced. Brad have you seen this anywhere else on this branch?

— Reply to this email directly, view it on GitHub https://github.com/s4hts/HTStream/pull/261#issuecomment-2152953297, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMME3Q53LFNI6KNAVFRA7T3ZGCFW3AVCNFSM6AAAAABI5CZPJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJSHE2TGMRZG4 . You are receiving this because you modified the open/close state.Message ID: @.***>

msettles commented 3 months ago

No in a large dataset, samtools fails with invalid QUAL character found. Looking at the read in the bam file it show a bad character in the middle of a read, only in overlapped SE reads though, Pairs work just fine.

I’m backing up to before you changes now to see if the error is repeated there. And will report back

Matt

On June 6, 2024 at 9:40:37 AM, Bradley N. Jenner @.***) wrote:

I have not. Is this an issue present in the test dataset? Also, what is the exact issue? Is it just characters some how being introduced or switched?

On Thu, Jun 6, 2024 at 9:36 AM Matt Settles @.***> wrote:

somewhere however, now incorrect QUAL characters are getting introduced. Brad have you seen this anywhere else on this branch?

— Reply to this email directly, view it on GitHub https://github.com/s4hts/HTStream/pull/261#issuecomment-2152953297, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AMME3Q53LFNI6KNAVFRA7T3ZGCFW3AVCNFSM6AAAAABI5CZPJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJSHE2TGMRZG4>

. You are receiving this because you modified the open/close state.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/s4hts/HTStream/pull/261#issuecomment-2152963162, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE6RZXSVXMPMGLVNXX2VBTZGCGILAVCNFSM6AAAAABI5CZPJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJSHE3DGMJWGI . You are receiving this because you commented.Message ID: @.***>