Closed biobenkj closed 6 months ago
Yes -- you can certainly restructure the input so that it fits kb-python (because nothing from the input FASTQ was actually lost aside from some error correction). It is not a trivial task but it is 100% doable with writing a series of awk scripts (chatgpt can definitely help with that).
Sounds like a plan. Now for a really silly question - not every read in SS3 has a UMI read and it looks like the expected input for kb count is 2 sets of fastqs as you've outlined in https://github.com/pachterlab/kallisto/issues/297#issuecomment-1400763372:
Read 1 = UMI (and maybe cell barcode?) + cDNA Read 2 = cDNA
SMARTSEQ3 Smart-seq3 0,11,19 0,11,None 1,None,None
I might need a little guidance on how to reconstruct these - specifically, what goes in each of the expected 4 fastqs (or more specifically the index 1 and index 2 files that kb handles internally). I recognize this is beyond what an issue for kb looks like and no worries if this isn't something you'd like to do. Thanks!
For SS3, kb count takes in 4 FASTQs: The first two being the barcodes and the second two being the paired-end reads.
You should be put the barcodes (the BC:Z: from the first and second read pair) into the first two FASTQ files and the sequences (i.e. the actual reads themselves from the first and second read pair) into the last two FASTQ files.
The third FASTQ file should then be modified such that for each UMI read, an 11-bp constant tag (ATTGCGCAATG) is placed at the beginning of the read followed by the 8-bp UMI (I assume the UMI is the sequence in UB:Z:? You should probably check to make sure) followed by a 3-bp constant sequence (GGG).
The tricky part is that your smart-seq3 protocol looks different than the original one (i.e. there is a 10-bp UMI rather than an 8-bp one). So maybe you'll just have to use the first 8-bp's of UB:Z:.
If you want to use all 10 bp's of UB:Z:, you'll have to do custom processing:
You can simply separate the non-UMI reads and the UMI reads into separate FASTQ files and process the non-UMI reads using kb count -x smartseq2 --parity=paired (as if it's smartseq2 data) by, again, putting the BC:Z: into the first two files and the actual read pairs themselves into the last two FASTQ files. For the UMI reads, you can put the BC:Z: into the first two FASTQ files, create a third file that contains the UMI, then use the fourth and fifth files to contain the actual paired-end reads. You then specify kb count -x 0,0,0,1,0,0:2,0,10:3,0,0,4,0,0 --parity=paired --strand=forward and that should work.
An alternative option: You can reformat the reads (using 10-bp UMI rather than 8-bp UMI) into smart-seq3 format [like I described previously] but work outside the framework of kb count (using kallisto and bustools as standalone programs); such as using kallisto bus -x 0,0,0,1,0,0:2,0,21:2,24,0,3,0,0 --tag=ATTGCGCAATG --paired --fr-stranded ... which will then get you your BUS file that you can process just as kb count would (you can view the commands that kb count runs on a typical smart-seq3 run by using the --verbose option when processing a standard smart-seq3 dataset in kb count).
Anyway, it's a bit complicated but it will work (no guarantees that the exact instructions above will work since I didn't double check whether I made typos or errors).
Thank you for the multiple options and a very clear explanation of things! Yes, this is Smart-seq3xpress instead of smart-seq3.
So far, I think I'm probably going to go with alternative option 2 and work outside the kb framework. But for anyone who has a similar question or if it's useful to you, I'm drafting some python code to do the conversion and will modify as it crystallizes:
#!/usr/bin/env python3
import pysam
import argparse
import gzip
def convert_ss3_ubam_to_fastq(bam_file, fastq_prefix, ub_tag_length):
# Set the check_sq to False as this is an unmapped bam
with pysam.AlignmentFile(bam_file, "rb", check_sq=False) as bam:
with gzip.open(f"{fastq_prefix}_I1.fastq.gz", "wt") as i1_file, \
gzip.open(f"{fastq_prefix}_I2.fastq.gz", "wt") as i2_file, \
gzip.open(f"{fastq_prefix}_R1.fastq.gz", "wt") as r1_file, \
gzip.open(f"{fastq_prefix}_R2.fastq.gz", "wt") as r2_file:
for read in bam.fetch(until_eof=True):
# Extract tags
bc = read.get_tag("BC")
qb = read.get_tag("QB")
ub = read.get_tag("UB")
qu = read.get_tag("QU")
# Determine read name
read_name = f"@{read.query_name}"
# Process R1 and R2 FASTQ files
if read.is_read1:
seq, qual = read.query_sequence, read.qual
# Check UB tag length and prepend sequence if necessary
if len(ub) == ub_tag_length:
seq = "ATTGCGCAATG" + ub + "GGG" + seq
qual = "IIIIIIIIIII" + qu + "III" + qual
r1_file.write(f"{read_name}\n{seq}\n+\n{qual}\n")
# Write barcode info only for first read in pair
i1_file.write(f"{read_name}\n{bc}\n+\n{qb}\n")
elif read.is_read2:
r2_file.write(f"{read_name}\n{read.query_sequence}\n+\n{read.qual}\n")
# Only write to I2 for read2 to avoid duplication
i2_file.write(f"{read_name}\n{bc}\n+\n{qb}\n")
def main():
parser = argparse.ArgumentParser(description='Convert BAM to FASTQ')
parser.add_argument('bam_file', type=str, help='Input BAM file')
parser.add_argument('fastq_prefix', type=str, help='Prefix for the output FASTQ files')
parser.add_argument('ub_tag_length', type=int, help='Expected length of the UB BAM tag')
args = parser.parse_args()
convert_ss3_ubam_to_fastq(args.bam_file, args.fastq_prefix, args.ub_tag_length)
if __name__ == "__main__":
main()
Trying to be a bit more flexible with UMI length.
Thank you, thank you for all your help!
This works really well and can confirm that it's behaving as expected using the longer UMI sequences via the second optional route you outlined above and regenerating the 4 fastqs using the python script above. Going to close this now. Thank you again!
Describe the issue After pulling a SMART-seq3 unmapped bam, which is effectively after the initial demultiplexing step out of zUMIs (https://github.com/sdparekh/zUMIs/wiki/Protocol-specific-setup) containing error corrected cell barcodes and both read types (UMI containing and internal) as noted by the UB tag for UMI containing reads and BC tag is error corrected, I'm wondering if there is a way to reconstruct the expected SS3 fastqs for input into kb python?
Example lines from unmapped BAM