pachterlab / kite

kallisto index tag extractor
BSD 2-Clause "Simplified" License
20 stars 6 forks source link

Support for ASAP-seq #5

Closed caleblareau closed 3 years ago

caleblareau commented 3 years ago

Hi kite team,

We recently put out a pre-print on quantifying protein counts using 10x scATAC-seq. We used an intermediate python script to convert the output of cellranger-atac mkfastq into something that we could fit into kite (specifically, we made the ATAC R1/R2/R3 look like an R1/R2 from a 10x v2 feature barcoding experiment). Funnily enough, my simple python script to cut and paste parts of fastqs was markedly slower than performing the tag abundances using kite, so any support for this assay would probably dramatically reduce the workflow time.

This python script has the key function to do the conversion of the reads before I feed them through kite. I don't have a good sense of how hard it would be to modify this tool to facilitate the R1/R2/R3 format of the scATAC data as that seems to be the barrier for kite to support this type of data directly.

Any input of the feasibility of supporting this would be great! Thanks! -Caleb

jasegehring commented 3 years ago

Hi Caleb,

Congrats on the recent manuscript! I was checking it out myself just the other day - very cool. OK I'm not super familiar with the 10X ATAC-seq workflow, but here are my thoughts. It looks like the ATAC-seq protocol generates a cell barcode (Read3) and two genomic DNA reads (R1/R2). If you know where your antibody barcodes are going to be (either in R1 or R2), then you might be able to just drop one of them and run kite only on the reads that include the cell barcode and the antibody oligo. If you don't know where the antibody barcodes are going to show up in R1 or R2, then my suggestion would be to simply run kite on R1/R3 and then again on R2/R3 and deal with any necessary merging at the BUS file stage where things are small and easy to quickly parse.

I hope this is helpful and makes sense. If not, or if you just find it sort of gross and sketchy, I'm happy to do a quick call and see if we can sort this out and come up with a better solution.

The kallisto | bustools team may be able to incorporate a third read into the workflow, but my guess is this would take quite a bit of engineering with relatively limited applications since kallisto is a poor choice for genome alignments. I could be wrong, though.

Looking forward to your thoughts and congrats again on the cool workflow.

Jase

On Tue, Sep 8, 2020 at 11:34 PM Caleb Lareau notifications@github.com wrote:

Hi kite team,

We recently put out a pre-print on quantifying protein counts using 10x scATAC-seq https://www.biorxiv.org/content/10.1101/2020.09.08.286914v1. We used an intermediate python script to convert the output of cellranger-atac mkfastq into something that we could fit into kite (specifically, we made the ATAC R1/R2/R3 look like an R1/R2 from a 10x v2 feature barcoding experiment). Funnily enough, my simple python script to cut and paste parts of fastqs was markedly slower than performing the tag abundances using kite, so any support for this assay would probably dramatically reduce the workflow time.

This python script has the key function https://github.com/caleblareau/asap_to_kite/blob/master/asap_to_kite_v2.py#L104 to do the conversion of the reads before I feed them through kite. I don't have a good sense of how hard it would be to modify this tool to facilitate the R1/R2/R3 format of the scATAC data as that seems to be the barrier for kite to support this type of data directly.

Any input of the feasibility of supporting this would be great! Thanks! -Caleb

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pachterlab/kite/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4UGX425OQVM46WFYIT3HLSE4OYPANCNFSM4RBHMUVQ .

caleblareau commented 3 years ago

Hey Jase, Thanks for the quick and detailed response and the kind words!

Unfortunately, the UMI (or in our case UBI), cell barcode, and antibody barcode are all encoded into 3 different files-- so there's not a way that I can only supply 2 of the 3 files. It seems like will will necessarily have to wrap the R1/R2/R3 into another format. We have an existing solution linked that I can keep using and will suggest to others to do the same.

"The kallisto | bustools team may be able to incorporate a third read into the workflow, but my guess is this would take quite a bit of engineering with relatively limited applications since kallisto is a poor choice for genome alignments."

^^ this was my sense too but wanted to raise it just in case.

Thanks again for the response and maintaining kite... I use it almost everyday it seems! -Caleb