pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
141 stars 23 forks source link

Restructure unmapped SMART-seq3 BAM from ENA to proper fastqs for kb #230

Closed biobenkj closed 6 months ago

biobenkj commented 6 months ago

Describe the issue After pulling a SMART-seq3 unmapped bam, which is effectively after the initial demultiplexing step out of zUMIs (https://github.com/sdparekh/zUMIs/wiki/Protocol-specific-setup) containing error corrected cell barcodes and both read types (UMI containing and internal) as noted by the UB tag for UMI containing reads and BC tag is error corrected, I'm wondering if there is a way to reconstruct the expected SS3 fastqs for input into kb python?

Example lines from unmapped BAM

V300090689L1C001R0020000368/1   77  *   0   0   *   *   0   0   CAACCANCCGGAGCAGGCAAGGCTGGAACGGGAACGAGGGGTCCCCCCACCCCCAGGACCGCCCTCTGAGCCACGCGACTGCGACTTTGCCCAGAAGGGA    :FFE9F!FEF-CF>EDFFDE2>EFCF=FEFFC>>8FAFEB?E)BED2FF@6F>FA)/'/::BEFDC;;=FAC;,65E8DD6.)D?*FEB@F4CF:B?8?;    BX:Z:TACGTAATCGCGGATTCCTG   BC:Z:TACGTAATCGCGGATTCCTG   UB:Z:   QB:Z:EFEA@FDFFFFFDEFFFFFE   QU:Z:
V300090689L1C001R0020000368/1   141 *   0   0   *   *   0   0   GCCTCCAGCTCTCTTCACAAGAGCGCTTGTCAGAGTCATCCCCGCTGTCCGAGTCTGCATACAGTCGCTGCTGCTCCAGCAGAAGATCGGCTTCTTCCTT    EFFFFFFEEFFFDFF9CFFED8ED1CF;FDFFF;FFFFFCC@FAFFFDF>DDFAFFF-FDFEE8@FAECFFEFEFFFEBF7BCFC0?FD43FF=@FAEEF    BX:Z:TACGTAATCGCGGATTCCTG   BC:Z:TACGTAATCGCGGATTCCTG   UB:Z:   QB:Z:EFEA@FDFFFFFDEFFFFFE   QU:Z:
V300090689L1C001R0020000489/1   77  *   0   0   *   *   0   0   ATGGCACATGCAGCGCAAGTAGGTCTACAAGACGCTACTTCCCCTATCATAGAAGAGCTTATCACCTTTCATGATC    )FF7FGF@DFFFFFFFCEF>FFFFF:CFFFF6DCF9F5AFCGFE:F6FG=ED7@F5GECF:FGEFFDG8DFFF<?F    BX:Z:TTGGTTCCAATTCTGTCTGA   BC:Z:TTGGTTCCAATTCTGTCTGA   UB:Z:CGTTAGATTT QB:Z:BEFEEEFFFFDFEFE@?FE+   QU:Z:FG<:FFBFFF
V300090689L1C001R0020000489/1   141 *   0   0   *   *   0   0   GTACTACTCGATTGTCAACGTCAAGGAGTCGCAGGTCGCCTGGGTCTAGGAATAATGGGGGAAGTATGTAGGAGTTGAAGATTAGGCCGCCGGAGTCGGG    BE@:D;DC8B=D9FC<9=(>;BE7DA<EA@BC7C@*C56?CB:)AB=FA87F=F>>=?(F97BE1F&DEAED<D;AEDB</:-?<)<C;+>D'B;4CC?)    BX:Z:TTGGTTCCAATTCTGTCTGA   BC:Z:TTGGTTCCAATTCTGTCTGA   UB:Z:CGTTAGATTT QB:Z:BEFEEEFFFFDFEFE@?FE+   QU:Z:FG<:FFBFFF
V300090689L1C001R0020000928/1   77  *   0   0   *   *   0   0   CCTCATNTCTCAAGATGCTGGACGAGATGAAGGACAACCTGCTCCTGGACATGTATCTGGCCCCCCATGGCTGGACCCTGTACACCCACATTCGCAACCG    F94FEF!A@EFF8E>DB=FFFDFFEF+FC/)FB7FF7EEE5FCFFEFA5@FDFEDEGDFDEFFF=4EE57F'FDEEEFDF:8FFEF7*+8DFFFA(FAFE    BX:Z:CACCAAGGACCGGATTCCTG   BC:Z:CACCAAGGACCGGATTCCTG   UB:Z:   QB:Z:FFFFFFGFB>FEFBFFFFFD   QU:Z:
V300090689L1C001R0020000928/1   141 *   0   0   *   *   0   0   GGCTGTTGGCTGGAGTCAGCTCCCCCTGGCTCCCTTCTCTGGGCGGGGACTTGCTGTGTGGGTCCACGCGGGCACTGATCAGCCCCTCCAGGATTAGCTG    FEEFFEDFFEEFB6EDF:F<AFFFFA;FFD=EE7?DFEF>A<<F7F=F=FCEFEEFFF<F.F&/BFF'9D,FFF2;FAE7DFEEFFD<DCEE15A<FFE<    BX:Z:CACCAAGGACCGGATTCCTG   BC:Z:CACCAAGGACCGGATTCCTG   UB:Z:   QB:Z:FFFFFFGFB>FEFBFFFFFD   QU:Z:
V300090689L1C001R0020000976/1   77  *   0   0   *   *   0   0   GCTTTTGCAGGTGGCGGCGAACGCGGAGAGCACGCCATGAAGGCCTCGGGCACGCTACGAGAGTACAAGGTAGTGG    FDFGDFFFEF6=FF@DEFF9CEEDFF<F/FFFFFDFFFF8EFFDFF97F?EDAFEF7FDF4C6+@@DCG4DAF1>?    BX:Z:TCTCTAGACTAGTCCTCGTC   BC:Z:TCTCTAGACTAGTCCTCGTC   UB:Z:TCGTGCGTAT QB:Z:.;E@C(C398@A>9DE<;CA   QU:Z:GEAFFFFFFF
V300090689L1C001R0020000976/1   141 *   0   0   *   *   0   0   TGTTGTGGGTGCCGCTCCGGGAGTCATAGCGCAGCCAGATCCCGAAGATCTTCACCCGCAGGGGGGACTTCTCAAACACCTCCCCACAGTAGACAATCTC    ECD?F>C9E:@0CD@CA6:-D-C>7>6<=)<8:F7<D9-A;A4/>.F%=';4*4BDBF>.;E&=?D>8'E3;B)C>?7=<D+2E>@CBC.,B*0,70'7>    BX:Z:TCTCTAGACTAGTCCTCGTC   BC:Z:TCTCTAGACTAGTCCTCGTC   UB:Z:TCGTGCGTAT QB:Z:.;E@C(C398@A>9DE<;CA   QU:Z:GEAFFFFFFF
V300090689L1C001R0020000997/1   77  *   0   0   *   *   0   0   GGATATGATAGTCTGCGCAGCGTATGCACACGAACTGCCAAAATATGGTGTGAAGGTTGGCCTGACAAATTATGCT    C?FFGFFFF@FFDEF7EFFEEF:FFFEFFFFF6F>F4F@FBFFFGD?FGFD6FFF9DFFEEFFF7*FEFFDG?;FF    BX:Z:TTGGTCAGTTCTCTGACGAA   BC:Z:TTGGTCAGTTCTCTGACGAA   UB:Z:CTGGGAGATA QB:Z:FFFFEFFFBFFFFFD@BF-E   QU:Z:BFB>E7EEFB
V300090689L1C001R0020000997/1   141 *   0   0   *   *   0   0   CCCATGATGTGCTTCCGATGTACTTCTGCATTAAATTCCTTGCTTTCAGAATCATAACCAGGGAATCGTTTGGTACTGTGAGGGATAGACAAGCCTCCAT    EFAFDA<FECFDEFDFFBEBEFEFEFFCFD9<ACEFFFEEF>ECFFFE@D7FFFFFF<=<29DE?F?F?EDCFEDFFC4F3BEE3CFDE@EFFE;F3EEB    BX:Z:TTGGTCAGTTCTCTGACGAA   BC:Z:TTGGTCAGTTCTCTGACGAA   UB:Z:CTGGGAGATA QB:Z:FFFFEFFFBFFFFFD@BF-E   QU:Z:BFB>E7EEFB
Yenaled commented 6 months ago

Yes -- you can certainly restructure the input so that it fits kb-python (because nothing from the input FASTQ was actually lost aside from some error correction). It is not a trivial task but it is 100% doable with writing a series of awk scripts (chatgpt can definitely help with that).

biobenkj commented 6 months ago

Sounds like a plan. Now for a really silly question - not every read in SS3 has a UMI read and it looks like the expected input for kb count is 2 sets of fastqs as you've outlined in https://github.com/pachterlab/kallisto/issues/297#issuecomment-1400763372:

Read 1 = UMI (and maybe cell barcode?) + cDNA Read 2 = cDNA

SMARTSEQ3    Smart-seq3         0,11,19    0,11,None 1,None,None

I might need a little guidance on how to reconstruct these - specifically, what goes in each of the expected 4 fastqs (or more specifically the index 1 and index 2 files that kb handles internally). I recognize this is beyond what an issue for kb looks like and no worries if this isn't something you'd like to do. Thanks!

Yenaled commented 6 months ago

For SS3, kb count takes in 4 FASTQs: The first two being the barcodes and the second two being the paired-end reads.

You should be put the barcodes (the BC:Z: from the first and second read pair) into the first two FASTQ files and the sequences (i.e. the actual reads themselves from the first and second read pair) into the last two FASTQ files.

The third FASTQ file should then be modified such that for each UMI read, an 11-bp constant tag (ATTGCGCAATG) is placed at the beginning of the read followed by the 8-bp UMI (I assume the UMI is the sequence in UB:Z:? You should probably check to make sure) followed by a 3-bp constant sequence (GGG).

The tricky part is that your smart-seq3 protocol looks different than the original one (i.e. there is a 10-bp UMI rather than an 8-bp one). So maybe you'll just have to use the first 8-bp's of UB:Z:.

If you want to use all 10 bp's of UB:Z:, you'll have to do custom processing:

  1. You can simply separate the non-UMI reads and the UMI reads into separate FASTQ files and process the non-UMI reads using kb count -x smartseq2 --parity=paired (as if it's smartseq2 data) by, again, putting the BC:Z: into the first two files and the actual read pairs themselves into the last two FASTQ files. For the UMI reads, you can put the BC:Z: into the first two FASTQ files, create a third file that contains the UMI, then use the fourth and fifth files to contain the actual paired-end reads. You then specify kb count -x 0,0,0,1,0,0:2,0,10:3,0,0,4,0,0 --parity=paired --strand=forward and that should work.

  2. An alternative option: You can reformat the reads (using 10-bp UMI rather than 8-bp UMI) into smart-seq3 format [like I described previously] but work outside the framework of kb count (using kallisto and bustools as standalone programs); such as using kallisto bus -x 0,0,0,1,0,0:2,0,21:2,24,0,3,0,0 --tag=ATTGCGCAATG --paired --fr-stranded ... which will then get you your BUS file that you can process just as kb count would (you can view the commands that kb count runs on a typical smart-seq3 run by using the --verbose option when processing a standard smart-seq3 dataset in kb count).

Anyway, it's a bit complicated but it will work (no guarantees that the exact instructions above will work since I didn't double check whether I made typos or errors).

biobenkj commented 6 months ago

Thank you for the multiple options and a very clear explanation of things! Yes, this is Smart-seq3xpress instead of smart-seq3.

So far, I think I'm probably going to go with alternative option 2 and work outside the kb framework. But for anyone who has a similar question or if it's useful to you, I'm drafting some python code to do the conversion and will modify as it crystallizes:

#!/usr/bin/env python3

import pysam
import argparse
import gzip

def convert_ss3_ubam_to_fastq(bam_file, fastq_prefix, ub_tag_length):
    # Set the check_sq to False as this is an unmapped bam
    with pysam.AlignmentFile(bam_file, "rb", check_sq=False) as bam:
        with gzip.open(f"{fastq_prefix}_I1.fastq.gz", "wt") as i1_file, \
             gzip.open(f"{fastq_prefix}_I2.fastq.gz", "wt") as i2_file, \
             gzip.open(f"{fastq_prefix}_R1.fastq.gz", "wt") as r1_file, \
             gzip.open(f"{fastq_prefix}_R2.fastq.gz", "wt") as r2_file:

            for read in bam.fetch(until_eof=True):
                # Extract tags
                bc = read.get_tag("BC")
                qb = read.get_tag("QB")
                ub = read.get_tag("UB")
                qu = read.get_tag("QU")

                # Determine read name
                read_name = f"@{read.query_name}"

                # Process R1 and R2 FASTQ files
                if read.is_read1:
                    seq, qual = read.query_sequence, read.qual
                    # Check UB tag length and prepend sequence if necessary
                    if len(ub) == ub_tag_length:
                        seq = "ATTGCGCAATG" + ub + "GGG" + seq
                        qual = "IIIIIIIIIII" + qu + "III" + qual
                    r1_file.write(f"{read_name}\n{seq}\n+\n{qual}\n")
                    # Write barcode info only for first read in pair
                    i1_file.write(f"{read_name}\n{bc}\n+\n{qb}\n")
                elif read.is_read2:
                    r2_file.write(f"{read_name}\n{read.query_sequence}\n+\n{read.qual}\n")
                    # Only write to I2 for read2 to avoid duplication
                    i2_file.write(f"{read_name}\n{bc}\n+\n{qb}\n")

def main():
    parser = argparse.ArgumentParser(description='Convert BAM to FASTQ')
    parser.add_argument('bam_file', type=str, help='Input BAM file')
    parser.add_argument('fastq_prefix', type=str, help='Prefix for the output FASTQ files')
    parser.add_argument('ub_tag_length', type=int, help='Expected length of the UB BAM tag')

    args = parser.parse_args()
    convert_ss3_ubam_to_fastq(args.bam_file, args.fastq_prefix, args.ub_tag_length)

if __name__ == "__main__":
    main()

Trying to be a bit more flexible with UMI length.

Thank you, thank you for all your help!

biobenkj commented 6 months ago

This works really well and can confirm that it's behaving as expected using the longer UMI sequences via the second optional route you outlined above and regenerating the 4 fastqs using the python script above. Going to close this now. Thank you again!