Closed biobenkj closed 11 months ago
Hi,
Thanks for the message. I am not sure if I understand the question fully, do you mean you have run the fqfilter and correct_BC steps manually? If yes, I would recommend to run the whole pipeline from start ("which_stage: Filtering"), otherwise from the pipelines own output "which_stage: Mapping" should do the trick.
For Smartseq3xpress, the read layout looks odd in your YAML file, read1 will contain UMI at bases 12-21 and cell barcodes ("BC") are appended to read2 in our data generated on MGI sequencers have a check here for an example YAML: https://www.protocols.io/view/smart-seq3xpress-yxmvmk1yng3p/v2?step=16
All the best, Christoph
Hi @cziegenhain,
Thanks for the reply. The unmapped bam is how the SS3xpress data are deposited, though zUMIs expects fastqs. I'm trying to short circuit the pipeline by beginning at mapping stage by passing the unmapped bam, however there appear to be some files generated during the filtering stage that are needed later (e.g. *kept_barcodes.txt). Thus, I'm passing some "dummy fastq" files so that zUMIs can complete the checks for the existence of the fastqs, adding in the pattern so the pipeline will expect these to be SS3 type data, and renaming the expected barcodes file to match what zUMIs is expecting.
Can zUMIs restart filtering from an unmapped bam?
Any additional insight you have would be great, thank you!
Aha, I see! Yes you will also need the barcode stats that are collected during regular first steps of zUMIs which may be cumbersome to recreate. For the Smart-seq3xpress paper data submission, we originally also submitted fastq files in the repository which will be much easier than the ArrayExpress style unmapped bam which also has different tags for BCs and UMIs than zUMIs likes! However they show as "unavailable" right now so I am not sure what happened there. If you email me I can see what I can dig up for you in terms of fastq, again that will be much smoother for you to reprocess.
Email sent :)
Thanks again!
Hi @cziegenhain - I was able to (I think) reverse engineer the uBAM back to the original 4 fastq files with the following code:
#!/usr/bin/env python3
import pysam
import argparse
import gzip
def convert_ss3_ubam_to_fastq(bam_file, fastq_prefix, ub_tag_length):
# Set the check_sq to False as this is an unmapped bam
with pysam.AlignmentFile(bam_file, "rb", check_sq=False) as bam:
with gzip.open(f"{fastq_prefix}_I1.fastq.gz", "wt") as i1_file, \
gzip.open(f"{fastq_prefix}_I2.fastq.gz", "wt") as i2_file, \
gzip.open(f"{fastq_prefix}_R1.fastq.gz", "wt") as r1_file, \
gzip.open(f"{fastq_prefix}_R2.fastq.gz", "wt") as r2_file:
for read in bam.fetch(until_eof=True):
# Extract tags
bc = read.get_tag("BC")
qb = read.get_tag("QB")
ub = read.get_tag("UB")
qu = read.get_tag("QU")
# Determine read name
read_name = f"@{read.query_name}"
# Process R1 and R2 FASTQ files
if read.is_read1:
seq, qual = read.query_sequence, read.qual
# Check UB tag length and prepend sequence if necessary
if len(ub) == ub_tag_length:
seq = "ATTGCGCAATG" + ub + "GGG" + seq
qual = "IIIIIIIIIII" + qu + "III" + qual
r1_file.write(f"{read_name}\n{seq}\n+\n{qual}\n")
# Write barcode info only for first read in pair
i1_file.write(f"{read_name}\n{bc}\n+\n{qb}\n")
elif read.is_read2:
r2_file.write(f"{read_name}\n{read.query_sequence}\n+\n{read.qual}\n")
# Only write to I2 for read2 to avoid duplication
i2_file.write(f"{read_name}\n{bc}\n+\n{qb}\n")
def main():
parser = argparse.ArgumentParser(description='Convert BAM to FASTQ')
parser.add_argument('bam_file', type=str, help='Input BAM file')
parser.add_argument('fastq_prefix', type=str, help='Prefix for the output FASTQ files')
parser.add_argument('ub_tag_length', type=int, help='Expected length of the UB BAM tag')
args = parser.parse_args()
convert_ss3_ubam_to_fastq(args.bam_file, args.fastq_prefix, args.ub_tag_length)
if __name__ == "__main__":
main()
If it's correct, I might just restart zUMIs with these FASTQ files. Please do correct me if I'm wrong:
I1.fastq.gz and I2.fastq.gz both will contain the cell barcode (in this case it's already error corrected)
R1.fastq.gz contains the first read in pair - if there's a UMI, add the specific handle ATTGCGCAATG
and then the UMI, followed by the cDNA.
R2.fastq.gz contains the second read in pair
Hej Ben, That actually sounds like a great solution!! I dont see any issues with that strategy.
Great! I appreciate all your help and will close this for now. Thanks again!
Hi zUMIs team - thanks for a great tool. I'm working on running some published Smart-seq3xpress data that has been pulled through fqfilter_v2.pl and correct_BCtag.pl. However, I'm not sure how to start from the resulting, merged bam going into zUMIs. If you'd be willing to provide insight into how to get this rolling, I'd be most grateful.
I've tried setting dummy input fastq files and setting the pattern to the hard coded nucleotide sequence for SS3. I can get most of the way though but will fail at the DGE step.
head
of the bam:zUMIs.yaml
Thanks so much!