sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
275 stars 68 forks source link

Beginning zUMIs from unmapped, tagged, corrected BAM out of fqfilter_v2.pl #384

Closed biobenkj closed 11 months ago

biobenkj commented 11 months ago

Hi zUMIs team - thanks for a great tool. I'm working on running some published Smart-seq3xpress data that has been pulled through fqfilter_v2.pl and correct_BCtag.pl. However, I'm not sure how to start from the resulting, merged bam going into zUMIs. If you'd be willing to provide insight into how to get this rolling, I'd be most grateful.

I've tried setting dummy input fastq files and setting the pattern to the hard coded nucleotide sequence for SS3. I can get most of the way though but will fail at the DGE step.

head of the bam:

V300090689L1C001R0020000368/1   77  *   0   0   *   *   0   0   CAACCANCCGGAGCAGGCAAGGCTGGAACGGGAACGAGGGGTCCCCCCACCCCCAGGACCGCCCTCTGAGCCACGCGACTGCGACTTTGCCCAGAAGGGA    :FFE9F!FEF-CF>EDFFDE2>EFCF=FEFFC>>8FAFEB?E)BED2FF@6F>FA)/'/::BEFDC;;=FAC;,65E8DD6.)D?*FEB@F4CF:B?8?;BX:Z:TACGTAATCGCGGATTCCTG   BC:Z:TACGTAATCGCGGATTCCTG   UB:Z:   QB:Z:EFEA@FDFFFFFDEFFFFFE   QU:Z:
V300090689L1C001R0020000368/1   141 *   0   0   *   *   0   0   GCCTCCAGCTCTCTTCACAAGAGCGCTTGTCAGAGTCATCCCCGCTGTCCGAGTCTGCATACAGTCGCTGCTGCTCCAGCAGAAGATCGGCTTCTTCCTT    EFFFFFFEEFFFDFF9CFFED8ED1CF;FDFFF;FFFFFCC@FAFFFDF>DDFAFFF-FDFEE8@FAECFFEFEFFFEBF7BCFC0?FD43FF=@FAEEFBX:Z:TACGTAATCGCGGATTCCTG   BC:Z:TACGTAATCGCGGATTCCTG   UB:Z:   QB:Z:EFEA@FDFFFFFDEFFFFFE   QU:Z:
V300090689L1C001R0020000489/1   77  *   0   0   *   *   0   0   ATGGCACATGCAGCGCAAGTAGGTCTACAAGACGCTACTTCCCCTATCATAGAAGAGCTTATCACCTTTCATGATC    )FF7FGF@DFFFFFFFCEF>FFFFF:CFFFF6DCF9F5AFCGFE:F6FG=ED7@F5GECF:FGEFFDG8DFFF<?F    BX:Z:TTGGTTCCAATTCTGTCTGA   BC:Z:TTGGTTCCAATTCTGTCTGA   UB:Z:CGTTAGATTT QB:Z:BEFEEEFFFFDFEFE@?FE+   QU:Z:FG<:FFBFFF
V300090689L1C001R0020000489/1   141 *   0   0   *   *   0   0   GTACTACTCGATTGTCAACGTCAAGGAGTCGCAGGTCGCCTGGGTCTAGGAATAATGGGGGAAGTATGTAGGAGTTGAAGATTAGGCCGCCGGAGTCGGG    BE@:D;DC8B=D9FC<9=(>;BE7DA<EA@BC7C@*C56?CB:)AB=FA87F=F>>=?(F97BE1F&DEAED<D;AEDB</:-?<)<C;+>D'B;4CC?)BX:Z:TTGGTTCCAATTCTGTCTGA   BC:Z:TTGGTTCCAATTCTGTCTGA   UB:Z:CGTTAGATTT QB:Z:BEFEEEFFFFDFEFE@?FE+   QU:Z:FG<:FFBFFF
V300090689L1C001R0020000928/1   77  *   0   0   *   *   0   0   CCTCATNTCTCAAGATGCTGGACGAGATGAAGGACAACCTGCTCCTGGACATGTATCTGGCCCCCCATGGCTGGACCCTGTACACCCACATTCGCAACCG    F94FEF!A@EFF8E>DB=FFFDFFEF+FC/)FB7FF7EEE5FCFFEFA5@FDFEDEGDFDEFFF=4EE57F'FDEEEFDF:8FFEF7*+8DFFFA(FAFEBX:Z:CACCAAGGACCGGATTCCTG   BC:Z:CACCAAGGACCGGATTCCTG   UB:Z:   QB:Z:FFFFFFGFB>FEFBFFFFFD   QU:Z:
V300090689L1C001R0020000928/1   141 *   0   0   *   *   0   0   GGCTGTTGGCTGGAGTCAGCTCCCCCTGGCTCCCTTCTCTGGGCGGGGACTTGCTGTGTGGGTCCACGCGGGCACTGATCAGCCCCTCCAGGATTAGCTG    FEEFFEDFFEEFB6EDF:F<AFFFFA;FFD=EE7?DFEF>A<<F7F=F=FCEFEEFFF<F.F&/BFF'9D,FFF2;FAE7DFEEFFD<DCEE15A<FFE<BX:Z:CACCAAGGACCGGATTCCTG   BC:Z:CACCAAGGACCGGATTCCTG   UB:Z:   QB:Z:FFFFFFGFB>FEFBFFFFFD   QU:Z:
V300090689L1C001R0020000976/1   77  *   0   0   *   *   0   0   GCTTTTGCAGGTGGCGGCGAACGCGGAGAGCACGCCATGAAGGCCTCGGGCACGCTACGAGAGTACAAGGTAGTGG    FDFGDFFFEF6=FF@DEFF9CEEDFF<F/FFFFFDFFFF8EFFDFF97F?EDAFEF7FDF4C6+@@DCG4DAF1>?    BX:Z:TCTCTAGACTAGTCCTCGTC   BC:Z:TCTCTAGACTAGTCCTCGTC   UB:Z:TCGTGCGTAT QB:Z:.;E@C(C398@A>9DE<;CA   QU:Z:GEAFFFFFFF
V300090689L1C001R0020000976/1   141 *   0   0   *   *   0   0   TGTTGTGGGTGCCGCTCCGGGAGTCATAGCGCAGCCAGATCCCGAAGATCTTCACCCGCAGGGGGGACTTCTCAAACACCTCCCCACAGTAGACAATCTC    ECD?F>C9E:@0CD@CA6:-D-C>7>6<=)<8:F7<D9-A;A4/>.F%=';4*4BDBF>.;E&=?D>8'E3;B)C>?7=<D+2E>@CBC.,B*0,70'7>BX:Z:TCTCTAGACTAGTCCTCGTC   BC:Z:TCTCTAGACTAGTCCTCGTC   UB:Z:TCGTGCGTAT QB:Z:.;E@C(C398@A>9DE<;CA   QU:Z:GEAFFFFFFF
V300090689L1C001R0020000997/1   77  *   0   0   *   *   0   0   GGATATGATAGTCTGCGCAGCGTATGCACACGAACTGCCAAAATATGGTGTGAAGGTTGGCCTGACAAATTATGCT    C?FFGFFFF@FFDEF7EFFEEF:FFFEFFFFF6F>F4F@FBFFFGD?FGFD6FFF9DFFEEFFF7*FEFFDG?;FF    BX:Z:TTGGTCAGTTCTCTGACGAA   BC:Z:TTGGTCAGTTCTCTGACGAA   UB:Z:CTGGGAGATA QB:Z:FFFFEFFFBFFFFFD@BF-E   QU:Z:BFB>E7EEFB
V300090689L1C001R0020000997/1   141 *   0   0   *   *   0   0   CCCATGATGTGCTTCCGATGTACTTCTGCATTAAATTCCTTGCTTTCAGAATCATAACCAGGGAATCGTTTGGTACTGTGAGGGATAGACAAGCCTCCAT    EFAFDA<FECFDEFDFFBEBEFEFEFFCFD9<ACEFFFEEF>ECFFFE@D7FFFFFF<=<29DE?F?F?EDCFEDFFC4F3BEE3CFDE@EFFE;F3EEBBX:Z:TTGGTCAGTTCTCTGACGAA   BC:Z:TTGGTCAGTTCTCTGACGAA   UB:Z:CTGGGAGATA QB:Z:FFFFEFFFBFFFFFD@BF-E   QU:Z:BFB>E7EEFB

zUMIs.yaml

###########################################
#Welcome to zUMIs
#below, please fill the mandatory inputs
#We expect full paths for all files.
###########################################

#define a project name that will be used to name output files
project: HEK.Smartseq3xpress

#Sequencing File Inputs:
#For each input file, make one list object & define path and barcode ranges
#base definition vocabulary: BC(n) UMI(n) cDNA(n).
#Barcode range definition needs to account for all ranges. You can give several comma-separated ranges for BC & UMI sequences, eg. BC(1-6,20-26)
#you can specify between 1 and 4 input files
sequence_files:
  file1:
    name: fake_r1.fastq #path to first file
    base_definition:
      - BC(1-6)
      - UMI(7-16)
    find_pattern: ATTGCGCAATG
  file2:
    name: fake_r2.fastq #path to second file
    base_definition:
      - cDNA(1-50)

#reference genome setup
reference:
  STAR_index: /varidata/research/projects/shen/projects/2023_10_23_SS3/analysis/star_ref #path to STAR genome index
  GTF_file: /varidata/research/projects/shen/projects/2023_10_23_SS3/analysis/star_ref/Homo_sapiens.GRCh38.84.gtf #path to gene annotation file in GTF format
  exon_extension: no #extend exons by a certain width?
  extension_length: 0 #number of bp to extend exons by
  scaffold_length_min: 0 #minimal scaffold/chromosome length to consider (0 = all)
  additional_files: /varidata/research/projects/shen/projects/2023_10_23_SS3/analysis/star_ref/molecularSpikes_complexset_5p.fa #Optional parameter. It is possible to give additional reference sequences here, eg ERCC.fa
  additional_STAR_params: #Optional parameter. you may add custom mapping parameters to STAR here

#output directory
out_dir: /varidata/research/projects/shen/projects/2023_10_23_SS3/analysis/zumis_hek293t #specify the full path to the output directory

###########################################
#below, you may optionally change default parameters
###########################################

#number of processors to use
num_threads: 40
mem_limit: null #Memory limit in Gigabytes, null meaning unlimited RAM usage.

#barcode & UMI filtering options
#number of bases under the base quality cutoff that should be filtered out.
#Phred score base-cutoff for quality control.
filter_cutoffs:
  BC_filter:
    num_bases: 4
    phred: 20
  UMI_filter:
    num_bases: 3
    phred: 20

#Options for Barcode handling
#You can give either number of top barcodes to use or give an annotation of cell barcodes.
#If you leave both barcode_num and barcode_file empty, zUMIs will perform automatic cell barcode selection for you!
barcodes:
  barcode_num: null
  barcode_file: HEK_expected_barcodes_offcial_Smartseq3xpress.txt
  barcode_sharing: null #Optional for combining several barcode sequences per cell (see github wiki)
  automatic: no #Give yes/no to this option. If the cell barcodes should be detected automatically. If the barcode file is given in combination with automatic barcode detection, the list of given barcodes will be used as whitelist.
  BarcodeBinning: 1 #Hamming distance binning of close cell barcode sequences.
  nReadsperCell: 20000 #Keep only the cell barcodes with atleast n number of reads.
  demultiplex: no #produce per-cell demultiplexed bam files.

#Options related to counting of reads towards expression profiles
counting_opts:
  introns: yes #can be set to no for exon-only counting.
  intronProb: no #perform an estimation of how likely intronic reads are to be derived from mRNA by comparing to intergenic counts.
  downsampling: 100000 #Number of reads to downsample to. This value can be a fixed number of reads (e.g. 10000) or a desired range (e.g. 10000-20000) Barcodes with less than <d> will not be reported. 0 means adaptive downsampling. Default: 0.
  strand: 1 #Is the library stranded? 0 = unstranded, 1 = positively stranded, 2 = negatively stranded
  Ham_Dist: 2 #Hamming distance collapsing of UMI sequences.
  velocyto: no #Would you like velocyto to do counting of intron-exon spanning reads
  primaryHit: yes #Do you want to count the primary Hits of multimapping reads towards gene expression levels?
  multi_overlap: no #Do you want to assign reads overlapping to multiple features?
  fraction_overlap: 0 #minimum required fraction of the read overlapping with the gene for read assignment to genes
  twoPass: no #perform basic STAR twoPass mapping

#produce stats files and plots?
make_stats: yes

#Start zUMIs from stage. Possible TEXT(Filtering, Mapping, Counting, Summarising). Default: Filtering.
which_Stage: Mapping

#define dependencies program paths
samtools_exec: samtools #samtools executable
Rscript_exec: Rscript #Rscript executable
STAR_exec: STAR #STAR executable
pigz_exec: pigz #pigz executable
zUMIs_directory: /varidata/research/projects/shen/tools/benkj/zUMIs
#below, fqfilter will add a read_layout flag defining SE or PE
read_layout: PE

Thanks so much!

cziegenhain commented 11 months ago

Hi,

Thanks for the message. I am not sure if I understand the question fully, do you mean you have run the fqfilter and correct_BC steps manually? If yes, I would recommend to run the whole pipeline from start ("which_stage: Filtering"), otherwise from the pipelines own output "which_stage: Mapping" should do the trick.

For Smartseq3xpress, the read layout looks odd in your YAML file, read1 will contain UMI at bases 12-21 and cell barcodes ("BC") are appended to read2 in our data generated on MGI sequencers have a check here for an example YAML: https://www.protocols.io/view/smart-seq3xpress-yxmvmk1yng3p/v2?step=16

All the best, Christoph

biobenkj commented 11 months ago

Hi @cziegenhain,

Thanks for the reply. The unmapped bam is how the SS3xpress data are deposited, though zUMIs expects fastqs. I'm trying to short circuit the pipeline by beginning at mapping stage by passing the unmapped bam, however there appear to be some files generated during the filtering stage that are needed later (e.g. *kept_barcodes.txt). Thus, I'm passing some "dummy fastq" files so that zUMIs can complete the checks for the existence of the fastqs, adding in the pattern so the pipeline will expect these to be SS3 type data, and renaming the expected barcodes file to match what zUMIs is expecting.

Can zUMIs restart filtering from an unmapped bam?

Any additional insight you have would be great, thank you!

cziegenhain commented 11 months ago

Aha, I see! Yes you will also need the barcode stats that are collected during regular first steps of zUMIs which may be cumbersome to recreate. For the Smart-seq3xpress paper data submission, we originally also submitted fastq files in the repository which will be much easier than the ArrayExpress style unmapped bam which also has different tags for BCs and UMIs than zUMIs likes! However they show as "unavailable" right now so I am not sure what happened there. If you email me I can see what I can dig up for you in terms of fastq, again that will be much smoother for you to reprocess.

biobenkj commented 11 months ago

Email sent :)

Thanks again!

biobenkj commented 11 months ago

Hi @cziegenhain - I was able to (I think) reverse engineer the uBAM back to the original 4 fastq files with the following code:

#!/usr/bin/env python3

import pysam
import argparse
import gzip

def convert_ss3_ubam_to_fastq(bam_file, fastq_prefix, ub_tag_length):
    # Set the check_sq to False as this is an unmapped bam
    with pysam.AlignmentFile(bam_file, "rb", check_sq=False) as bam:
        with gzip.open(f"{fastq_prefix}_I1.fastq.gz", "wt") as i1_file, \
             gzip.open(f"{fastq_prefix}_I2.fastq.gz", "wt") as i2_file, \
             gzip.open(f"{fastq_prefix}_R1.fastq.gz", "wt") as r1_file, \
             gzip.open(f"{fastq_prefix}_R2.fastq.gz", "wt") as r2_file:

            for read in bam.fetch(until_eof=True):
                # Extract tags
                bc = read.get_tag("BC")
                qb = read.get_tag("QB")
                ub = read.get_tag("UB")
                qu = read.get_tag("QU")

                # Determine read name
                read_name = f"@{read.query_name}"

                # Process R1 and R2 FASTQ files
                if read.is_read1:
                    seq, qual = read.query_sequence, read.qual
                    # Check UB tag length and prepend sequence if necessary
                    if len(ub) == ub_tag_length:
                        seq = "ATTGCGCAATG" + ub + "GGG" + seq
                        qual = "IIIIIIIIIII" + qu + "III" + qual
                    r1_file.write(f"{read_name}\n{seq}\n+\n{qual}\n")
                    # Write barcode info only for first read in pair
                    i1_file.write(f"{read_name}\n{bc}\n+\n{qb}\n")
                elif read.is_read2:
                    r2_file.write(f"{read_name}\n{read.query_sequence}\n+\n{read.qual}\n")
                    # Only write to I2 for read2 to avoid duplication
                    i2_file.write(f"{read_name}\n{bc}\n+\n{qb}\n")

def main():
    parser = argparse.ArgumentParser(description='Convert BAM to FASTQ')
    parser.add_argument('bam_file', type=str, help='Input BAM file')
    parser.add_argument('fastq_prefix', type=str, help='Prefix for the output FASTQ files')
    parser.add_argument('ub_tag_length', type=int, help='Expected length of the UB BAM tag')

    args = parser.parse_args()
    convert_ss3_ubam_to_fastq(args.bam_file, args.fastq_prefix, args.ub_tag_length)

if __name__ == "__main__":
    main()

If it's correct, I might just restart zUMIs with these FASTQ files. Please do correct me if I'm wrong:

I1.fastq.gz and I2.fastq.gz both will contain the cell barcode (in this case it's already error corrected) R1.fastq.gz contains the first read in pair - if there's a UMI, add the specific handle ATTGCGCAATG and then the UMI, followed by the cDNA. R2.fastq.gz contains the second read in pair

cziegenhain commented 11 months ago

Hej Ben, That actually sounds like a great solution!! I dont see any issues with that strategy.

biobenkj commented 11 months ago

Great! I appreciate all your help and will close this for now. Thanks again!