Implement zUMI for single fastq MARS-seq data

sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs

GNU General Public License v3.0

275 stars 67 forks source link

Implement zUMI for single fastq MARS-seq data #111

Closed kritikakarri closed 5 years ago

kritikakarri commented 5 years ago

I have mars-seq dataset that I am trying to preprocess: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE108561 where the samples uploded have been processes as follows : "Paired-end, but the second read contained only cell and molecule barcodes. This information was appended to the fastq entry header bcl2fastq/2.15.0.4 Sequences with RMT of low quality (defined as RMT with minimum Phred score of less than 27) were filtered out. Pool-barcode and well-barcode-RMT were extracted from the first and second end of the read (respectively) and concatenated to the fastq header, delimited by a underscore i.e. POOL_BARCODE_WELL_BARCODE_RMT while "NNNNNN" was used as a place holders if plate barcode was not used.Reads were separated by POOL_BARCODE_WELL_BARCODE header data, allowing 1 sequencing error. This process created a single fastq file for each source well."

How can I modify the parameters for zUMI in order to process the dataset of this format?

cziegenhain commented 5 years ago

Hi,

I'm assuming both of the issues are the same question, so I closed the other one. Sorry to say that zUMIs does not support cell barcodes in the header. Basically we actively decided against this because there are a number of drawbacks: a) There is no standard way of delimiting barcodes & UMIs making it hard to extract for a variety of users and methods b) The base quality information is usually lost. I understand that these are prefiltered but you cannot run zUMIs from the file you have. Best solutions would be to either make a "fake" barcode-fastq file by parsing the header back into fastq format or to ask the authors of the paper if they can share their paired-end fastq files.

Good luck, Christoph

kritikakarri commented 5 years ago

Yes, that's what I was thought so because the yaml file in zUMI requires to mention the BC and UMI position but in this dataset, they are already merged in the read1 header. Thanks a lot, Christoph

kritikakarri commented 5 years ago

Christoph, So given that I don't have access to the original files from the author. I have chosen to extract the barcode from R1 header and recreate R2 fastq files.Thankfully, they have also retained the quality score.

@SRR3928573.1 NB501277:61:HTNKHBGXX:1:11101:11520:1071_0_barcode=NA-E/A-//A//6#-/##/####-AAAC-AACACCN-CNNANNNN/1 CATCCCCGCCGCGCGTCGCGGCGTGGGAAATGTGGCGTACGGAAGACCCACTCCCCGGCGCCGCTCGT +

Now, if I have use zUMI on datasets with pool barcode, well barcode and UMI. How do I prepare my yaml file?

sdparekh commented 5 years ago

Hi, If your R2 has pool barcode, well barcode and UMI (hopefully without "-") in between. Considering the pool, well barcode and UMI are in consecutive order and they are 4+7+8 bases long, You can use BC tag for pool and well barcodes as show below.

sequence_files: file1: name: R1cdna.fastq.gz base_definition:

cDNA(1-50) file2: name: R2barcodes.fastq.gz base_definition:
BC(1-11)
UMI(12-20)

You can use our shiny app to create the yaml file or use the one given in example data for easier access.

kritikakarri commented 5 years ago

Thanks a lot for the tip Swati. I think this will work then.