vpc-ccg / calib

Calib clusters barcode tagged paired-end reads based on their barcode and sequence similarity.
MIT License
38 stars 9 forks source link

What's the best way to generate de-duplicated fastq files from the cluster file? #31

Closed sneha-nishtala closed 4 years ago

sneha-nishtala commented 4 years ago

Hello,

I have 2 questions -

  1. My UMI barcodes are in this format - 'XXXXXXXXNNNNNNNNN'. So, when I use calib, my -l option for barcode tag length should be 8 or 17?
  2. After running calib and generating the cluster file, what is the best way to generate deduplicated fastq files?

Thanks!

baraaorabi commented 4 years ago

Are these single end or paired end reads?

The -l parameter is symmetric, so l=8 means 8+8. You can alternatively use the -l1 and -l2 parameters for first and second mates.

For deduplicating, Calib has a consensus module (calib_cons) that performs a fast multiple sequence alignment and column by column majority consensus for each cluster generated by Calib. I would recommend using that for deduplication.

On Wed., Jul. 8, 2020, 1:56 p.m. Sneha Nishtala, notifications@github.com wrote:

Hello,

I have 2 questions -

  1. My UMI barcodes are in this format - 'XXXXXXXXNNNNNNNNN'. So, when I use calib, my -l option for barcode tag length should be 8 or 17?
  2. After running calib and generating the cluster file, what is the best way to generate deduplicated fastq files?

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vpc-ccg/calib/issues/31, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABP6AOIUN74D3QAMYZXOAYTR2TMQ5ANCNFSM4OU6JEYA .

sneha-nishtala commented 4 years ago

Thank you for your quick response!

This is paired in data. I have the barcode in this format - 'XXXXXXXXNNNNNNNNN' followed by read. Here is is snippet of my R1.fastq file -

@M00206:61:000000000-G5N8B:1:1101:16313:1364 1:N:0:7 AGTTCAGGACTAAGACACTTCCCGGCAACAATTAATAGACTGGATGGAGGCGGATAAAGTTGCAGGACCACTTCTGCGCTCGGCCCTTCCGGCTGGCTGGTTTATTGCTGATAAATCTGGAGCCGGTGAGCGTGGGTCTCGCGGTATCATTGCAGCACTGGGGCC + 3AABBFFFFBFFGGGGGBABBBFADBBBBGGGGGFGEGFHGGCHHBFGHGGGGEFGHHBFHHHGFFFHHHHHHHHGHGGGG/EEEGHGGHFFE@EG/FFGBFGGFDDFHFFGHHFFHGHGHHFGCGBFGHHG?EGGCGHHHGDC.>.=GHHHFEHGHHEHGHGGG @M00206:61:000000000-G5N8B:1:1101:15143:1377 1:N:0:7 AGTTCAGGTGTAAACCTCTTGGAGCGAACGACCTACACCGAACTGAGATACCTACAGCGTGAG + AABBABFFBCB4AGGGGBBB@BFFFDDD@EEFAEFGFGFE??AFHHHGDDGGHHFGHHGGGGG @M00206:61:000000000-G5N8B:1:1101:16722:1379 1:N:0:7 AGTTCAGGGTAAAAGTTCTGAATGAAGCCATACCAAACGACGAGCGTGACACCACGATGCCTGCAGCAATGGCAACAACGTTGCGCAAACTATTAACTGGCGAACTACTTACTCTAG

baraaorabi commented 4 years ago

My pleasure!

So does this mean your barcode is only on R1? If so, you should use -l1 17 and -l2 0.

Also, what does X represent in the barcode?

sneha-nishtala commented 4 years ago

So, I have 4 files - R1,R2, I1 and I2. I1 has barcodes which all start with AGTTCAGG followed by 9 unique bases. I2 has the same set of bases throughout - CCAACAGA. So X basically represents the index and N represents the unique barcode.

So, I added the I1 barcode to R1 and didnt add anything to R2. Therefore, based on your suggestion, I think -l1 17 and -l2 0 makes sense. Right?

--

baraaorabi commented 4 years ago

Yes, I think it makes sense to prepend the barcodes like you did to R1. However, since the first 8 bases are always the same, I would drop them, and run with -l1 9 and -l2 0. The longer barcodes will take longer to process and the extra bit does not add any discriminatory information to the clustering process since the 8 bases sequence is constant.

On Wed., Jul. 8, 2020, 2:24 p.m. Sneha Nishtala, notifications@github.com wrote:

So, I have 4 files - R1,R2, I1 and I2. I1 has barcodes which all start with AGTTCAGG followed by 9 unique bases. I2 has the same set of bases throughout - CCAACAGA. So X basically represents the index and N represents the unique barcode. So, I added the I1 barcode to R1 and didnt add anything to R2. Therefore, based on your suggestion, I think -l1 17 and -l2 0 makes sense. Right?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vpc-ccg/calib/issues/31#issuecomment-655765450, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABP6AOPFKX3NSOQ4WOPWZSDR2TPX5ANCNFSM4OU6JEYA .

sneha-nishtala commented 4 years ago

Got it, Thank you!