sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
271 stars 67 forks source link

sci-RNA-seq with no separate FASTQs for index reads #265

Closed juicejulia closed 3 years ago

juicejulia commented 3 years ago

Hi there, Thank you for creating the wonderful zUMIs tool. I want to try this tool on the sci-RNA-seq data. We only have R1 and R2 fastq files and no separate index read files. Instead, the i5, i7 barcodes are embedded in the R1 and R2 header line. Below is an example: @K00124:663:HG7T2BBXX:7:1104:25733:17307 1:N:0:TAACTTGG+GTCGTGAA ATTCGCCTTGGATCTGAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCAGCTTTTAGGAAATTTATTTTCCTTCCATTTTTTTTTCCTTTGCTCAGGCACCTGCCCAGCAGCCCAGGACCCCTCAGGGGTGGGTCCCACCCCCCTAGG + AAFFFJJJJJJJFJJJFJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJ<-<7AJJJ-7--<--JJ-7FFJ---<-A----FJJJJ---7FJ-<-7<-AA-7<-77-77<--7---7-A--7-7-7FF-7A--A7--7--AAJ7----7- @K00124:663:HG7T2BBXX:7:1104:8613:17324 1:N:0:TAACTTGG+GTCGTGAA TGTTGTTTTTAACCGCGCTTTTTTTTTTTTTTTTTTTTTTTTTTAGCGAAGATTCTGTCTCTTATACACATCTCCGAGCCCACGAGACTAACTTGGTCATCTCGTATGCCGTCTTCTGCTTGAAAAAAAACAATAAGAACGTACAACTTA + AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ--<<F7F<7FJFJ-<-7<FJ-A7<A<-A-7-J-7AAJJ-AA7F<FJFJ<JJ-77<AF77A-A-A<<JA7FJJ-A-<7JF<FJJJJA--7------------------ @K00124:663:HG7T2BBXX:7:1104:17848:17324 1:N:0:TAACTTGG+GTCGTGAA GGAGCTTCGCACTAGGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGGGTTGGCTTTCTCTTTTTCACATCTCCCGGCCCCCGAGACTAAATTTTGTATCTCGTTTTGCGCCTTTTTCCTGAAAAAACAGGACAAATGAGTGAA + AAFFFJJJJJJJJFJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFA-F---7-A-F<--77--<-<------7<-----7AJ<A-7--7-----A------7<----<--7----7-7---7F-7FJFJ------------<7---

In this case, could I still use the zUMIs tools for processing? Thank you!

cziegenhain commented 3 years ago

Hi,

Not directly, no. zUMIs does not support barcodes embedded in the read header line, because it loses valuable barcode quality information. However, it should be possible to write a little awk command to generate a fastq file from the header with arbitrary set quality scores.

Best, C

juicejulia commented 3 years ago

Thank you for your fast response! In this case, I will give the barcodes with high-quality scores so they won't be filtered out? I am wondering, from your experience, how much does the barcode quality vary between different datasets? And how much impact does the barcode quality on the downstream process?

cziegenhain commented 3 years ago

Yes generate them with high quality so the filtering just doesnt apply! (eg. just phred 40 = I) If the barcodes were sufficiently diverse and cover all color signals well, it shouldnt matter all too much, in my experience I have seen everything from horrible to super high quality depending on the dataset & sequencing run ;-)

juicejulia commented 3 years ago

Great! I will give it a try. Thank you!

juicejulia commented 3 years ago

Hello @cziegenhain , I tried to follow your suggestion by appending the barcodes from the header to the tail end of R1 and matched it by pseudo-quality scores. Below is an example of my edited R1 file: @K00124:663:HG7T2BBXX:7:1101:4097:1415 1:N:0:NTCTACGG+NGATCTCG NTGTGTGCCTGAGTATGGTACAGCTAATGGCCGTCTTCATTTCCATGCGGTGCACTTTATGCGGACACTTCNTACAGGTNGCGTTNACCCTAANTTTGNTCNTNGGGTACGCAATCGCCGCCAGTTAAATAGCTTGCAAAATACGTGGCCNTCTACGGNGATCTCG +

AAF-F-FJJJJJJFFJFJJAJJJJJJJJFF<JJ7JJJJJJJJJJJ-J7JJJJJJJJJ<FJJJJFJJJJJJ#FJJJJFF#JFJJJ#FJJJJJJ#JJJJ#JF#J#JJJJJJFJFAJJF7JJF-7AJFJJFJJFFJFF--FJJJJFFFJJJ-IIIIIIIIIIIIIIII

@K00124:663:HG7T2BBXX:7:1101:4746:1415 1:N:0:NAACTTGG+NAGAAGCC NGCGTGTTTGGATCTGAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTGGCATTGGACTTTTCTTNTTANACATTTCNGAGCCNCCGGGGCNTACTNGGNTNTTCCGTTTGCCGTTTTTTTCTTTAAAAAAAAAATTTTTTTTTTTAANAACTTGGNAGAAGCC +

AAF<AAJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ-<AF--FJ-F<<7-7---7#---#------7#---7J#-77-7--#-<-7#-7#-#7----77<-<A---7<-<-<7F--<JJJJJFJ--<AAJJA------IIIIIIIIIIIIIIII

However, I received the following error message when executing zUMIs.

Wed Jun 16 11:07:54 EDT 2021 Filtering... sh: 1: Syntax error: Unterminated quoted string sh: 1: Syntax error: Unterminated quoted string Wed Jun 16 11:07:55 EDT 2021 Warning message: package ‘ggplot2’ was built under R version 4.0.3 Error in eval(bysub, parent.frame(), parent.frame()) : object 'XC' not found Calls: cellBC -> [ -> [.data.table -> eval -> eval

Attached is the yaml file I used. Could you help me troubleshoot? Thank you so much! test_zUMIs.yaml.txt

cziegenhain commented 3 years ago

Try to gzip your fastq files, I think there are issues with just using plain .fq!

juicejulia commented 3 years ago

Thank you! This indeed seems to be the problem! At least I have moved to Mapping step now. Are gz files always required as input?