Closed sneha-nishtala closed 4 years ago
Are these single end or paired end reads?
The -l
parameter is symmetric, so l=8 means 8+8. You can alternatively use
the -l1
and -l2
parameters for first and second mates.
For deduplicating, Calib has a consensus module (calib_cons
) that performs
a fast multiple sequence alignment and column by column majority consensus
for each cluster generated by Calib. I would recommend using that for
deduplication.
On Wed., Jul. 8, 2020, 1:56 p.m. Sneha Nishtala, notifications@github.com wrote:
Hello,
I have 2 questions -
- My UMI barcodes are in this format - 'XXXXXXXXNNNNNNNNN'. So, when I use calib, my -l option for barcode tag length should be 8 or 17?
- After running calib and generating the cluster file, what is the best way to generate deduplicated fastq files?
Thanks!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vpc-ccg/calib/issues/31, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABP6AOIUN74D3QAMYZXOAYTR2TMQ5ANCNFSM4OU6JEYA .
Thank you for your quick response!
This is paired in data. I have the barcode in this format - 'XXXXXXXXNNNNNNNNN' followed by read. Here is is snippet of my R1.fastq file -
@M00206:61:000000000-G5N8B:1:1101:16313:1364 1:N:0:7 AGTTCAGGACTAAGACACTTCCCGGCAACAATTAATAGACTGGATGGAGGCGGATAAAGTTGCAGGACCACTTCTGCGCTCGGCCCTTCCGGCTGGCTGGTTTATTGCTGATAAATCTGGAGCCGGTGAGCGTGGGTCTCGCGGTATCATTGCAGCACTGGGGCC + 3AABBFFFFBFFGGGGGBABBBFADBBBBGGGGGFGEGFHGGCHHBFGHGGGGEFGHHBFHHHGFFFHHHHHHHHGHGGGG/EEEGHGGHFFE@EG/FFGBFGGFDDFHFFGHHFFHGHGHHFGCGBFGHHG?EGGCGHHHGDC.>.=GHHHFEHGHHEHGHGGG @M00206:61:000000000-G5N8B:1:1101:15143:1377 1:N:0:7 AGTTCAGGTGTAAACCTCTTGGAGCGAACGACCTACACCGAACTGAGATACCTACAGCGTGAG + AABBABFFBCB4AGGGGBBB@BFFFDDD@EEFAEFGFGFE??AFHHHGDDGGHHFGHHGGGGG @M00206:61:000000000-G5N8B:1:1101:16722:1379 1:N:0:7 AGTTCAGGGTAAAAGTTCTGAATGAAGCCATACCAAACGACGAGCGTGACACCACGATGCCTGCAGCAATGGCAACAACGTTGCGCAAACTATTAACTGGCGAACTACTTACTCTAG
My pleasure!
So does this mean your barcode is only on R1? If so, you should use -l1 17
and -l2 0
.
Also, what does X
represent in the barcode?
So, I have 4 files - R1,R2, I1 and I2. I1 has barcodes which all start with AGTTCAGG followed by 9 unique bases. I2 has the same set of bases throughout - CCAACAGA. So X basically represents the index and N represents the unique barcode.
So, I added the I1 barcode to R1 and didnt add anything to R2. Therefore, based on your suggestion, I think -l1 17
and -l2 0
makes sense. Right?
--
Yes, I think it makes sense to prepend the barcodes like you did to R1. However, since the first 8 bases are always the same, I would drop them, and run with -l1 9 and -l2 0. The longer barcodes will take longer to process and the extra bit does not add any discriminatory information to the clustering process since the 8 bases sequence is constant.
On Wed., Jul. 8, 2020, 2:24 p.m. Sneha Nishtala, notifications@github.com wrote:
So, I have 4 files - R1,R2, I1 and I2. I1 has barcodes which all start with AGTTCAGG followed by 9 unique bases. I2 has the same set of bases throughout - CCAACAGA. So X basically represents the index and N represents the unique barcode. So, I added the I1 barcode to R1 and didnt add anything to R2. Therefore, based on your suggestion, I think -l1 17 and -l2 0 makes sense. Right?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vpc-ccg/calib/issues/31#issuecomment-655765450, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABP6AOPFKX3NSOQ4WOPWZSDR2TPX5ANCNFSM4OU6JEYA .
Got it, Thank you!
Hello,
I have 2 questions -
Thanks!