magicDGS / ReadTools

A Universal Toolkit for Handling Sequence Data from Different Sequencing Platforms
https://magicdgs.github.io/ReadTools/
MIT License
6 stars 3 forks source link

How to extract the UMI info in illumina read's name into a seperate tag #533

Open TendoLiu opened 4 years ago

TendoLiu commented 4 years ago

Hi, Have beening working on UMI collapsing of illumina DNA seq data. The fastq header looks like this. I wonder is there a way to transfer all the UMI like "TATGTNC+NNGAGCA" to a seperate tag which could be used by duplicates markers?

@NS500211:808:HW27KAFXY:1:11101:12228:1057:TATGTNC+NNGAGCA 1:N:0:TCCGGAGA

Thanks.

magicDGS commented 4 years ago

Hello @TendoLiu - the name of your read looks a bit weird to me, as it contains a Casava barcode (1:N:0:TCCGGAGA) and the UMI appended to the read name (TATGTNC+NNGAGCA). Is this a FASTQ or a BAM file?

ReadTools is a bit "picky" with read names, as it only understands 2 formats that are common:

ReadTools can handle only one of the problems that you are facing: the barcode separator could be overriden (although will still be used for all the output files) with the java property barcode_index_delimiter (so providing -Dbarcode_index_delimiter=+ in your case). Nevertheless, I am not sure if your use-case matches AssignReadGroupByBarcode, as it is designed for barcodes (like the one after the space) and not for UMIs (I am not familiar with them, but maybe appending them to the read name with : as separator is a standard there...)

Could you please clarify with this information? Thanks!