minoda-lab / universc

UniverSC: a flexible cross-platform single-cell data processing pipeline
https://genomec.gsc.riken.jp/gerg/UniverSC/UniverSC_app_release/
GNU General Public License v3.0
43 stars 7 forks source link

STRT-Seq technology and errors #12

Closed Davidwei7 closed 1 year ago

Davidwei7 commented 1 year ago

Dear Sir/Madam, Hope you are well. Following resolved problem with the docker image on my cluster, I tried my first run with the launch_universc.sh with technology of STRT-Seq. I ran everything in the docker image converted singularity image. My command is this: launch_universc.sh -R1 SRR6026844_sra_S1_L001_R1_001.fastq -R2 SRR6026844_sra_S1_L001_R2_001.fastq -t strt-seq -r /lustre/project/m2_jgu-canshank3/Comparison/Human/HomSap_GRCh38 -i SRR6026844

Please see the below snapshots of the processes and errors:

image image image image

Please see below the first few rows of the fastq files:

head -n 24 SRR6026844.sra_S1_L001_R1_001.fastq @SRR6026844.sra.fastq.1 1 length=150 AAGCAGTGGTATCAACGCAGAGTACATGGGGAAAAAGAGAAAAGTGGAGGGATGTGTGGGCCTAGACAGGGGAAAAAGGAGAACAGGAGGCTCCAGACTGGTGAGGAAGGGGAGTGGGCTGGGCGTGCGGCTCATGCCTGTCATCCCAGC +SRR6026844.sra.fastq.1 1 length=150 AA<<FFJJJFJJJJJJAJJJJJFA<JFJF<7FFFJJ--FJJJJA-F-AJJF<7FAA-FFJJ<AJJJFJJJ--7AJAJJFFJ<J7AJFA<FJ-7-AAJ7JF<<F7AJAAFJ7--777FJJFAA<JA-AJFAJJ-<7<7<FFAJF-FFAF7F @SRR6026844.sra.fastq.2 2 length=150 AAGCAGTGGTATCAACGCAGAGTACATGGGAAGCAGTGGTATCAACGCAGAGTACATGGGAAGCAGTGGTATCAACGCAGAGTACATGGGAAGCAGTGGTATCAACGCAGAGTACATGGGAAGCAGTCGTATCAAAGCAGAGTACATGGG +SRR6026844.sra.fastq.2 2 length=150 AA7FAF7-FFJJJJJJFJJJJJAAJJ7FFJJJJJJJJJJJJJJJJJJJJJJJFJJJFJJJAJJJJJJJJJAJJJJJJJJJJJ-<JJ<J<<-FJA<-<--A7AFJ--7AJJ<<-FF-FJAJA-A<-7F-7AA<--7---FF-)--<AJFJJ @SRR6026844.sra.fastq.3 3 length=150 AAGCAGTGGTATCAACGCAGAGTACATGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA +SRR6026844.sra.fastq.3 3 length=150 AA<FFF<FFFJJJJJJJFJ<JF<JFJ<A<7-FJJJJJFJFJFJJJJJJJJJJFJJJJJJJJJJJJJJJJAJFJJJJ<FJ-FJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJFJJJJAJJJAJFFJJJJJFFJJJJJFJJJFFAA7FFJ @SRR6026844.sra.fastq.4 4 length=150 TGACCTGTCCCCTCTGGCTGCCTCTGAGTCTGAATCTCCCAAAGAGAGAAACCAATTTCTAAGAGGACTGGATTGCAGAAGACTCGGGGACAACATTTGATCCAAGATCTTAAATGTTATATTGATAACCATGCTCAGCAATGAGCTATT +SRR6026844.sra.fastq.4 4 length=150 AA<-<7AFFJAFJJJF-7<FJFFAJJA7FFFJF7FJ7JJJF7FJAJFJFF7FFFFJ-FJJJ-<A-F<JFJ-<AA7-AJJJ<FFFJFFJF<JJAJF-FJA-FJJJ7FJJJJF<7A<FJFFJ-<AAJJJ7F7<F<FF7-<7<-<FAF<FAFJ @SRR6026844.sra.fastq.5 5 length=150 GGAAGGAAGGAAGAAAGAAAGAAAGATAGAGAGAGAGAGAGAGAGAGAAAGATAGAGAGAAATAAAGAAACAAAGAAAGAAAGAAAGAAAGAAAGAAAAAAAAAGAAAAATACAAAAAAAAAAATTCACTTAACTCAGGGGTTCGGAGAT +SRR6026844.sra.fastq.5 5 length=150 -A-FFF77F-F<<F<JJAF<AJFFJ----<<FJAFFF<FF<FJF<<AFJJJ<<FJJFJFJJ7-F<JAJJA---7A<7AF-<7-777AJAFJJ<-AJJ-F-<-A---A---77F7--7F<FAF<A-------7--7--A--))-)7)---7 @SRR6026844.sra.fastq.6 6 length=150 CCTCCAGATACCACTGAGCCTCTTGCCCATGATTCAGAGCTTTCAAGGATAGGCTTTATTCTGCAAGCAATCAAATAATAAATCTATTCTGCTGAGAGATCACAAAAAAAAAAAAAAAAAAAAAAAAAAACCTATTTGCTGATGAGATCA +SRR6026844.sra.fastq.6 6 length=150 AA<7-7<<FFJJFFFJJJJJJJ<FJF<J<JFJJJJJJJJJJJJJFJJJFA7F<FJJJFFFJJAJFJ<FF-FA77JA-AJFFJFFA7-FA-FJJJ-AFJ----<F-<AJJA<<--AF-7AFA--AF-A<--7-7-AA<---7---7---7-

head -n 24 SRR6026844_sra_S1_L001_R2_001.fastq @SRR6026844.sra.fastq.1 1 length=150 NAGGTGCATTCGCCCTCCGTAGAAATCCATGCCAAGTACGCTCCTTCCATTGATTTTCTTGGATCGGGTGTGCACCGCGTAGCTCAGCATGGCAAGTCTGTGTAGTCCGTGGACCCGCCAGGACCCCCCGCCGCACGAGACGCAATACGT +SRR6026844.sra.fastq.1 1 length=150

AAA--A--777-A---AA--7------7-7))--)---7)-7----7--------7--77-----7))-))7-7)<--))77-7--)---))--)----7-----7-))7))-)))))-))))-))))-)-)-)))))-))))7---7-

@SRR6026844.sra.fastq.2 2 length=150 NTATGACTCCACCCCTCAGAGAGGAGGAGGCGACGGGGACAACAACTCACAGAGAGCAAAGTCCGTGGCAACCACCCCGTCTGCGGAGAGCAGGTCCGACCCTACTAGACGAGAGACAACGAACGCCGGACCGCACAATGGCGAGAGCTA +SRR6026844.sra.fastq.2 2 length=150

<A-----A---7A--A--<-7---))7)7--7)-)---77<F-7-7--7---7-77------77--)))-7)--))))))))-))7)7))-)))))))))))-)----7--)-)----------)))))))))7-)<----))-)))-7

@SRR6026844.sra.fastq.3 3 length=150 NCCACATATAGGGAAACATTTTAATTCTTAGTTATTATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTATTTTTTTTTTTTTTTTTTTTTTTTCATTTTTTTTATTT +SRR6026844.sra.fastq.3 3 length=150

AAAA-F-A<-A--7----------7--------7------77--F<----7<-<7A<A--AA7-A-----F-<<-<-7--7--7-<77A<7A7-A<-A-F--<A<<-------------A<7-F7-------7----7----7---7--

@SRR6026844.sra.fastq.4 4 length=150 NACTTCGATATAAGATTTTTTTTTTTATTTATTACTCAAAGTTTAGAACATTTTATTAAAGTACAAAAATGTTAGAATTTAGCTAATAGAAAAACATAGTAAATATTTAAAAAAACGCTTATAAAATTACTCAAGGCACCCACAGAAAAC +SRR6026844.sra.fastq.4 4 length=150

AAFFAJ-FAJFJJJJ----------------------A---7-77-FFF----7--<-A-7A<--7<----A-A7<---77<-7<-77<FJ77-<-7--AAF-A-------AFJF<---7-<7--7-<----7--)-))-))))7-7--

@SRR6026844.sra.fastq.5 5 length=150 TCGAAGTATGGTGATATCGGAAGAGCTTCGAGTACGTAAATAGTGTAGATCTCGGTTGTCGTCTTATCATTAAAAAAACATTTCTTACTTTTCTCTCTTCGCACACCTCACTTCCTCGCTATATTGCTTCCTCCCTTCCGGGGACAGACC +SRR6026844.sra.fastq.5 5 length=150 --AF-FF<--7F7-A---<-7--------7-7---7---7-------7----<-)--)---------7---<------------------------7----)-))))-)------7-<))----7----7)--7<)-7<))----))--) @SRR6026844.sra.fastq.6 6 length=150 TATCAGCAAATAGGGTTTTTTTTTATTATTTTATTTTTTTTTTGATCCCTGAGGAGAATAGCGTTCATATGTGAGTTCTGGCAGAACAAAGGCTAACCTTGAAAGGCCTGTTATCTGGGACAGAAGCCCAGGAGTGCTCGTGTCTGTACC +SRR6026844.sra.fastq.6 6 length=150 AAAAAFF<FF-FJ-F----------------------------77---)-))7A-7<7----------7------------777)----7A--------7-<AA7<-)7-)-----77)))))----7)))-)7-))-)7)7-)------

**Do you have any idea why the process was not completed?

I also have some information regarding the fastq file and I am sharing here to see if it could be helpful us resolving this problem I am facing.

Firstly, the sequence structure of the fastq file is this:

image

Secondly, the more detail on how the author analysed their data is in their github (link: [https://github.com/zorrodong/HECA/tree/master/scRNA-seq_pipeline_hg38]).

I am not sure whether I am correct on thinking this:

  1. The sequence structure of this fastq files are opposite of what UniverSC assume by default, so I need to swap the order from read 1 to read 2 before running the command?
  2. The fastq file has its own designed barcodes so I will need to provide a list of barcodes by using -b barcode_96_8bp.txt (barcode_96_8bp.txt is found in their github page: [https://github.com/zorrodong/HECA/blob/master/scRNA-seq_pipeline_hg38/barcode_96_8bp.txt]).
  3. The sequence structure of this fastq file is in line with the strt-seq, but the barcodes length is 8 and UMI length is 8, so I need to use custom_8_8 in my command? Does our current UniverSC support this setting?
  4. The author of this fastq file also used umi_tools in the pipeline to firstly extract UMI and barcodes from the raw fastq file, do I need to do this first before using Universc? (the authors' pipeline is this: image )

I am terribly sorry for giving so much information on my issue. I am quite new to complex bioinformatic problems and want to use your software for integrative analysis. Because I have three datasets with BD-rhapsody technology and 10xGenomices and STRT-Seq (described above), and the STRT-Seq technology generated fastq file I described in this post is the main reference data we are comparing against, I want to do this correctly.

Thanks for developing this tool, and looking forward to your response.

David

Davidwei7 commented 1 year ago

Hi all, I was wondering if you had a chance to look into my issue I experiencing which is described in two posts? Thank you in advance. Looking forward to your response. Best Wishes, David

TomKellyGenetics commented 1 year ago

Hi, sorry for the delayed response. I don't have much time at the moment but I have some ideas on what may be causing this. I think it is unrelated to the technology. The "proc" command is used to detect the number of cores available to set the default number of threads.

Please try running it again with the number of threads set manually with --threads and let us know if you still have problems persisting. I'll note that you have different system configurations compared to ours discussed in previous issues some some dependencies may be missing.

TomKellyGenetics commented 1 year ago

This dataset appears to use SmartSeq2.

Construction protocol: A modified Smart-seq2 protocol was applied for single-cell RNA-seq. Briefly, a single cell was picked into the lysis buffer by mouth pipette. The reverse transcription reaction was performed with 24 oligo (dT) primer anchored with the 8 bp cell specific barcode, and also with 8 bp unique molecular identifiers (UMIs).

https://www.ncbi.nlm.nih.gov/sra/?term=SRR6026844 https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR6026844

Note that our code supports the following configurations: launch_universc.sh: STRT-Seq (6 bp barcode, no UMI): strt-seq launch_universc.sh: STRT-Seq-C1 (8 bp barcode, 5 bp UMI): strt-seq-c1 launch_universc.sh: STRT-Seq-2i (13 bp barcode, 6 bp UMI): strt-seq-2i

It is possible to support an 8bp UMI but it will require a dedicated configuration. If it is a popular protocol we can support this but it appears to be a custom workflow used in this paper. Another possible workaround is to rename R1 and R2 (manually switch them) and run custom_8_8 which assumes R1 contains [BC][UMI].... and R2 contains transcript reads (as for 10x settings).

I'll note the paper here to investigate later in more details:

Fan X et al., "Spatial transcriptomic survey of human embryonic cerebral cortex by single-cell RNA-seq analysis.", Cell Res, 2018 Jul;28(7):730-745

TomKellyGenetics commented 1 year ago

@Davidwei7 sorry for the delayed response. I've investigated issues with these protocols and updated the source code to support it.

Please note that this protocol by Fan et al. (2018) is significantly modified from the originally published data from Islam et al. (2011).

We modified the STRT-seq method for amplification of single-cell transcriptomes by changing the reverse transcription primer, the induced cell barcode, and the unique molecular identifier (UMI).

This requires a different bioinformatics approach.

Raw reads were first segregated based on the cell-specific barcode information in read 2 of the pair-ended reads. Then, sequences in read 1 were trimmed with customized scripts to remove the TSO sequence, the polyA tail sequence and sequences with low-quality bases (N > 10%) or contaminated with adapters. Subsequently, the stripped read 1 sequences were aligned to the hg19 human reference genome.

Therefore I have created separate technology settings "strt-seq" for the original protocol and "strt-seq-2018" for the custom version. I've pushed this new configuration to the "dev" branch so it is possible to update to the development version to try it. There are minor changes to the source code so I expect it will run without errors. I've tested it on raw SRA data in FASTQ format from both publications and confirmed it created Cell Ranger compatible files.

TomKellyGenetics commented 1 year ago

Closing this issue as this technology is now supported. Raw reads from SRR6026844 tested without errors. Please re-open of file another issue if there are still problems with your environment preventing you from replicating this.

adc0032 commented 9 months ago

Hi! has this been added to the main branch and incorporated into what is installed in the docker? I don't have an option for strt-seq-2018 in my docker installation.

TomKellyGenetics commented 9 months ago

v1.2.7 has been merged and released on GitHub. Docker builds are in progress and will be available soon.

TomKellyGenetics commented 9 months ago

@adc0032 @Davidwei7 @kbattenb The latest version (1.2.7) passed docker builds and is now available on dockerhub: https://hub.docker.com/r/tomkellygenetics/universc/tags

This version supports STRT-Seq, PIP-Seq, and VASA-Seq protocols. I have some minor changes to versioning and issues #17 or #20 under consideration but the above technologies should work resolving the above issues #12 and #16.

adc0032 commented 9 months ago

@TomKellyGenetics

Thank you for getting this updated!

Should strt-seq-2018 still be undergoing file format conversion in line 3138?

https://github.com/minoda-lab/universc/blob/7cbd039613b45c64f4b6d8219906aafda28dd5f9/launch_universc.sh#L3662

I don't see it included here in the STRT-Seq section.

TomKellyGenetics commented 8 months ago

I think not since UMIs are already included in the 2018 custom protocol. It may be necessary to remove the TSO sequence from as described in the paper (by hard trimming R1s) if you are using (paired-end) 10x 5' scRNA chemistry settings. I think it is not necessary to perform TSO conversion on the R2 after the barcode and UMI as you can just use 10x 3' scRNA chemistry settings which ignore the rest of this read.