marbl / merqury

k-mer based assembly evaluation
Other
272 stars 19 forks source link

10x chromium barcode trimming #66

Closed sstwins21 closed 9 months ago

sstwins21 commented 2 years ago

Hi,

This is a really nice program! Would it be okay if I can clarify something please? I want to use Chromium 10x reads to count the kmers. I read the document and you said to remove first 23 bases from the first reads. But when I looked at Chromium10x documents and it says that fist 16 bases are the barcodes and next 10 reads are UPI. So shouldn't I need to trim first 26 bases rather than 23?

Also after trimming those bases, can I assume it would be safe to map those reads using BWA or Bowtie2 rather than longranger?

Thank you for your help.

Kind regards, Shane

arangrhie commented 2 years ago

Hello Shane,

Thanks for the heads up!

Couple years back when I reached out to the 10X team, I have been told to remove 6 Illumina library + 1 padding + 16 barcode bases (so 6 + 1 + 16 = 23).

Perhaps the protocol has been updated since then? Would you mind sharing the link to the Chromium10x documents? I'll update the doc and the script for counting kmers in 10x reads accordingly with the doc.

Once the barcodes are trimmed, the reads become regular paired end Illumina sequences, and thus longranger would not be able to run properly. You'd need to use other aligners such as BWA or Bowtie2.

Longranger is designed to use the barcode information and runs BWA internally. May I ask the reason for ignoring the barcodes for mapping?

sstwins21 commented 2 years ago

Hi,

Thank you so much for fast reply. This is the document I saw: chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/viewer.html?pdfurl=https%3A%2F%2Fteichlab.github.io%2Fscg_lib_structs%2Fdata%2FCG000108_AssayConfiguration_SC3v2.pdf&clen=1230360&chunk=true

Maybe the barcode length depends on the library? Sorry, I am not too familiar with 10x reads and really like to clarify before I use them. I tried using longranger to align the reads to the reference genome, but it was taking too long. So I was hoping if just using the BWA or Bowtie2 might be faster and wanted to test them.

There are also high coverage, thus if using bowtie2 or BWA is faster than longranger, I would like to use the faster one. But would you recommend to use the longranger intead?

Thank you so much for your help.

Kind regards, Shane

arangrhie commented 2 years ago

Hello @sstwins21 , I'm afraid I can't access your link :) Would you mind sending that pdf to my email? It's rhiea @ nih.gov with no space.

I think it makes more sense to follow the best practices for alignment and variant calling; or at least use a tool that does barcode aware alignment. Otherwise it seems like a loss of information to me...

Check if longranger is running on a single node / single cpu. Seems like it is? This doc https://support.10xgenomics.com/genome-exome/software/pipelines/latest/advanced/cluster-mode has guidance to run it on clusters, if you have them. Otherwise check the longranger doc to run in multiple cpus.. Try contact their support team, they were pretty responsive and fast.