Open TendoLiu opened 4 years ago
Hello @TendoLiu - the name of your read looks a bit weird to me, as it contains a Casava barcode (1:N:0:TCCGGAGA
) and the UMI appended to the read name (TATGTNC+NNGAGCA
). Is this a FASTQ or a BAM file?
ReadTools
is a bit "picky" with read names, as it only understands 2 formats that are common:
@NS500211:808:HW27KAFXY:1:11101:12228:1057 1:N:0:TCCGGAGA
, where the identified barcode will be TCCGGAGA
@NS500211:808:HW27KAFXY:1:11101:12228:1057#TATGTNC+NNGAGCA
, where the identified barcode will be TATGTNC+NNGAGCA
. Note that, contrary to your case, the barcodes are separated from the read name by #
instead of :
., and that only one barcode is detected as +
is used for concatenation instead of the standard (in the specs), which is -
.ReadTools
can handle only one of the problems that you are facing: the barcode separator could be overriden (although will still be used for all the output files) with the java property barcode_index_delimiter
(so providing -Dbarcode_index_delimiter=+
in your case). Nevertheless, I am not sure if your use-case matches AssignReadGroupByBarcode
, as it is designed for barcodes (like the one after the space) and not for UMIs (I am not familiar with them, but maybe appending them to the read name with :
as separator is a standard there...)
Could you please clarify with this information? Thanks!
Hi, Have beening working on UMI collapsing of illumina DNA seq data. The fastq header looks like this. I wonder is there a way to transfer all the UMI like "TATGTNC+NNGAGCA" to a seperate tag which could be used by duplicates markers?
@NS500211:808:HW27KAFXY:1:11101:12228:1057:TATGTNC+NNGAGCA 1:N:0:TCCGGAGA
Thanks.