About repeating the analysis of Smartseq3.Fibroblasts.NovaSeq

loverlyday commented 3 years ago

Hi, I want to repeat the Fiborblasts.novaSeq data ,i use the zUMIs to get the bam file .then i want to use the ss3_isofrom.py to get the isoform. but i get a problem.

Traceback (most recent call last): File "/home/XXX/anaconda3/envs/trim/lib/python3.7/multiprocessing/pool.py", line 121, in worker result = (True, func(*args, *kwds)) File "/home/XXX/anaconda3/envs/trim/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar return list(map(args)) File "/home/XXX/temp/ERR3835349/pyModule/informative_reads.py", line 479, in _get_reads report_gene = gobj.get_aligned_reads(n_read_limit, passed_cells) File "/home/XXX/temp/ERR3835349/pyModule/informative_reads.py", line 90, in get_aligned_reads nreads = len([r_idx for r_idx, x in enumerate(r_iterator) if x.flag in self.strand_flags[self.strand]]) File "/home/XXX/temp/ERR3835349/pyModule/informative_reads.py", line 90, in nreads = len([r_idx for r_idx, x in enumerate(r_iterator) if x.flag in self.strand_flags[self.strand]]) KeyError: 1

my gtf file like this:

description: evidence-based annotation of the mouse genome (GRCm38), version M18 (Ensembl 93)

provider: GENCODE

contact: gencode-help@ebi.ac.uk

format: gtf

date: 2018-07-02

chr1 HAVANA gene 3073253 3074322 . + . gene_id "ENSMUSG00000102693.1"; gene_type "TEC"; gene_name "RP23-271O17.1"; level 2; havana_gene "OTTMUSG00000049935.1"; chr1 HAVANA transcript 3073253 3074322 . + . gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; ranscript_name "RP23-271O17.1-001"; level 2; transcript_support_level "NA"; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";

the temp and temp_merged.bed in .tempDir like this : _temp_xxx： chr1 133321874 133321976 . . +

_temp_xxx_merged: chr1 133619870 133621801 chr1 133622116 133622184 chr1 133622983 133623150

and the exon file like this: chr1 133619870 133621801 1 chr1 133622116 133622184 2 chr1 133622983 133623150 3 chr1 133624482 133624810 4 chr1 133625847 133625930 5 chr1 133626766 133627270 6

in you code you select exon.iloc[0,3] as strand,but when you define you strand_flags dict like this: self.strand_flags = {'+': [99, 147], '-': [83, 163]}

so ,what should i do to fix the problem ,or there is something wrong in my gtf file ?

Logancreator commented 3 years ago

@loverlyday Hi ，can you tell me the process of zUMIs? I run it many times，but the software always give me a error information as follow: Error: Execution halted YAML file has an error. Look at the zUMIs_YAMLerror.log or contact developers.

can u give me your YAML file to refer？ and I want to know whether you use the sequencing file（Smartseq3.Fibroblasts.NovaSeq.R1.fastq.gz,Smartseq3.Fibroblasts.NovaSeq.R2.fastq.gz,Smartseq3.Fibroblasts.NovaSeq.I1.fastq.gz,Smartseq3.Fibroblasts.NovaSeq.I2.fastq.gz）,gtf file, STAR index?

Hope your early reply.

loverlyday commented 3 years ago

the yaml file is like this: project: Smartseq3_Fibroblasts sequence_files: file1: name: /home/xxx/temp/smart-seq3_mouse/Smartseq3.Fibroblasts.NovaSeq.R1.fastq.gz base_definition:

cDNA(23-150)
UMI(12-19) find_pattern: ATTGCGCAATG file2: name: /home/xxx/temp/smart-seq3_mouse/Smartseq3.Fibroblasts.NovaSeq.R2.fastq.gz base_definition:
cDNA(1-150) file3: name: /home/xxx/temp/smart-seq3_mouse/Smartseq3.Fibroblasts.NovaSeq.I1.fastq.gz base_definition:
BC(1-8) file4: name: /home/xxx/temp/smart-seq3_mouse/Smartseq3.Fibroblasts.NovaSeq.I2.fastq.gz base_definition:
BC(1-8) reference: STAR_index: /home/xxx/ref/mm10_star2.7.3a_index GTF_file: /home/xxx/index/mm10/gencode.vM18.annotation.sorted.gtf additional_STAR_params: '--limitSjdbInsertNsj 2000000 --clip3pAdapterSeq CTGTCTCTTATACACATCT ' additional_files: out_dir: /home/xxx/temp/smart-seq3_mouse/new_en/zUMIs_out num_threads: 40 mem_limit: 100 filter_cutoffs: BC_filter: num_bases: 3 phred: 20 UMI_filter: num_bases: 3 phred: 20 barcodes: barcode_num: ~ barcode_file: /home/xxx/temp/smart-seq3_mouse/mouse_barcode.txt automatic: no BarcodeBinning: 1 nReadsperCell: 100 demultiplex: yes counting_opts: introns: yes downsampling: '0' strand: 0 Ham_Dist: 1 write_ham: yes velocyto: no primaryHit: yes twoPass: no make_stats: yes which_Stage: Filtering

the squence file were download from the paper link E-MTAB-8735 .the gtf file is gencode.vM18.annotation.sorted.gtf downlaod from gencode the STAR index using STAR_2.7.3a. By the way are you the author of the ss3_isofrom.py?

cziegenhain commented 3 years ago

@PingChen-Angela is the author of the ss3_isoform scripts and will look at your issue shortly.

@Jianye-Chang I have answered your question on the zUMIs github. The YAML posted above lacks the correct indentation, Please check carefully the documentation and examples: https://github.com/sdparekh/zUMIs/wiki/Usage#setup-using-the-yaml-config-file https://github.com/sandberg-lab/Smart-seq3/blob/master/allele_level_expression/mouse_cross.yaml https://www.protocols.io/view/smart-seq3-protocol-bcq4ivyw?step=18

Logancreator commented 3 years ago

@cziegenhain @loverlyday Thanks！！！

Logancreator commented 3 years ago

@loverlyday sorry,by the way, Do you use this method "git" to download zUMIs? I download it's tar.gz file.Emmmm,too much error information

loverlyday commented 3 years ago

@Jianye-Chang yes ,i use git download zUMIs .you should check the zUMIs dependence carefully. then you will install successful.

Logancreator commented 3 years ago

@Jianye-Chang yes ,i use git download zUMIs .you should check the zUMIs dependence carefully. then you will install successful.

Oh my god，my server download it so slowly，ok ，if i download it by git，no configuration dependencies are required？Just run it？Thx

loverlyday commented 3 years ago

@Jianye-Chang No,if you download by git,you should prepare the dependence before you install it. if not ,you can't run zUMIs successfully.

cziegenhain commented 3 years ago

Hi,

It is possible to run zUMIs with or without manual installation of the dependencies. The usage and potential installation is described in detail: https://github.com/sdparekh/zUMIs#quickstart https://github.com/sdparekh/zUMIs/wiki/Installation#dependencies

Logancreator commented 3 years ago

@loverlyday Hi,excuse me.Another problem arises...I run smart-seq3 data,but when the software run,it give me a information as follow:

[1] "2021-01-25 16:53:17 CST" [1] "Coordinate sorting intermediate bam file..." [bam_sort_core] merging from 0 files and 8 in-memory blocks... [1] "2021-01-25 17:00:58 CST" [1] "Hamming distance collapse in barcode chunk 1 out of 1" [1] "Splitting data for multicore hamming distance collapse..." [1] "Setting up multicore cluster & generating molecule mapping tables ..." [1] "Finished multi-threaded hamming distances" [1] "Correcting UMI barcode tags..." Traceback (most recent call last): File "/md01/changjy/software/zUMIs/correct_UBtag.py", line 123, in main() File "/md01/changjy/software/zUMIs/correct_UBtag.py", line 99, in main bcs = load_bcs(args.bcs) File "/md01/changjy/software/zUMIs/correct_UBtag.py", line 16, in load_bcs with open(bcpath) as f: FileNotFoundError: [Errno 2] No such file or directory: '/md01/changjy/data/result/zUMIs_output/Smart-seq3kept_barcodes_binned.txt' [1] "4.5e+08 Reads per chunk" [E::hts_open_format] Failed to open file "/md01/changjy/data/result/Smart-seq3.filtered.Aligned.GeneTagged.UBcorrected.sorted.bam" : No such file or directory samtools index: failed to open "/md01/changjy/data/result/Smart-seq3.filtered.Aligned.GeneTagged.UBcorrected.sorted.bam": No such file or directory [1] "2021-01-25 17:08:07 CST" [1] "Here are the detected subsampling options:" [1] "Automatic downsampling" [1] "Working on barcode chunk 1 out of 1" [1] "Processing 184 barcodes in this chunk..." Error in value[3L] : failed to open BamFile: file(s) do not exist: ‘/md01/changjy/data/result/Smart-seq3.filtered.Aligned.GeneTagged.UBcorrected.sorted.bam’ Calls: reads2genes_new ... tryCatch -> tryCatchList -> tryCatchOne -> Execution halted Mon Jan 25 17:08:08 CST 2021 Loading required package: yaml Loading required package: Matrix [1] "loomR found" Error in gzfile(file, "rb") : cannot open the connection Calls: rds_to_loom -> readRDS -> gzfile In addition: Warning message: In gzfile(file, "rb") : cannot open compressed file '/md01/changjy/data/result/zUMIs_output/expression/Smart-seq3.dgecounts.rds', probable reason 'No such file or directory' Execution halted Mon Jan 25 17:08:13 CST 2021 Descriptive statistics... [1] "I am loading useful packages for plotting..." [1] "2021-01-25 17:08:13 CST" Error in gzfile(file, "rb") : cannot open the connection Calls: readRDS -> gzfile In addition: Warning message: In gzfile(file, "rb") : cannot open compressed file '/md01/changjy/data/result/zUMIs_output/expression/Smart-seq3.dgecounts.rds', probable reason 'No such file or directory' Execution halted Mon Jan 25 17:08:18 CST 2021

This is very odd.Because “Smart-seq3kept_barcodes_binned.txt” Shouldn't this software be generating this file by himself？ But the infomation tell me that it doesn't have this document.....What？？？？

Hope your reply.Thx

cziegenhain commented 3 years ago

Please do not post the same comments in the zUMIs GitHub and here. I will reply to you when time permits.

kwglam commented 3 years ago

Hi, I want to repeat the Fiborblasts.novaSeq data ,i use the zUMIs to get the bam file .then i want to use the ss3_isofrom.py to get the isoform. but i get a problem.

Traceback (most recent call last): File "/home/XXX/anaconda3/envs/trim/lib/python3.7/multiprocessing/pool.py", line 121, in worker result = (True, func(*args, *kwds)) File "/home/XXX/anaconda3/envs/trim/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar return list(map(args)) File "/home/XXX/temp/ERR3835349/pyModule/informative_reads.py", line 479, in _get_reads report_gene = gobj.get_aligned_reads(n_read_limit, passed_cells) File "/home/XXX/temp/ERR3835349/pyModule/informative_reads.py", line 90, in get_aligned_reads nreads = len([r_idx for r_idx, x in enumerate(r_iterator) if x.flag in self.strand_flags[self.strand]]) File "/home/XXX/temp/ERR3835349/pyModule/informative_reads.py", line 90, in nreads = len([r_idx for r_idx, x in enumerate(r_iterator) if x.flag in self.strand_flags[self.strand]]) KeyError: 1

my gtf file like this:

description: evidence-based annotation of the mouse genome (GRCm38), version M18 (Ensembl 93)

provider: GENCODE

contact: gencode-help@ebi.ac.uk

format: gtf

date: 2018-07-02

chr1 HAVANA gene 3073253 3074322 . + . gene_id "ENSMUSG00000102693.1"; gene_type "TEC"; gene_name "RP23-271O17.1"; level 2; havana_gene "OTTMUSG00000049935.1"; chr1 HAVANA transcript 3073253 3074322 . + . gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; ranscript_name "RP23-271O17.1-001"; level 2; transcript_support_level "NA"; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";

the temp and temp_merged.bed in .tempDir like this : _temp_xxx： chr1 133321874 133321976 . . +

_temp_xxx_merged: chr1 133619870 133621801 chr1 133622116 133622184 chr1 133622983 133623150

and the exon file like this: chr1 133619870 133621801 1 chr1 133622116 133622184 2 chr1 133622983 133623150 3 chr1 133624482 133624810 4 chr1 133625847 133625930 5 chr1 133626766 133627270 6

in you code you select exon.iloc[0,3] as strand,but when you define you strand_flags dict like this: self.strand_flags = {'+': [99, 147], '-': [83, 163]}

so ,what should i do to fix the problem ,or there is something wrong in my gtf file ?

Hi @loverlyday,

I have got the same error flags as what you posted in the first thread. Any chance you can kindly share your experience of fixing so? Thanks!

PingChen-Angela commented 3 years ago

Hi, I want to repeat the Fiborblasts.novaSeq data ,i use the zUMIs to get the bam file .then i want to use the ss3_isofrom.py to get the isoform. but i get a problem.

Traceback (most recent call last): File "/home/XXX/anaconda3/envs/trim/lib/python3.7/multiprocessing/pool.py", line 121, in worker result = (True, func(*args, *kwds)) File "/home/XXX/anaconda3/envs/trim/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar return list(map(args)) File "/home/XXX/temp/ERR3835349/pyModule/informative_reads.py", line 479, in _get_reads report_gene = gobj.get_aligned_reads(n_read_limit, passed_cells) File "/home/XXX/temp/ERR3835349/pyModule/informative_reads.py", line 90, in get_aligned_reads nreads = len([r_idx for r_idx, x in enumerate(r_iterator) if x.flag in self.strand_flags[self.strand]]) File "/home/XXX/temp/ERR3835349/pyModule/informative_reads.py", line 90, in nreads = len([r_idx for r_idx, x in enumerate(r_iterator) if x.flag in self.strand_flags[self.strand]]) KeyError: 1

my gtf file like this:

description: evidence-based annotation of the mouse genome (GRCm38), version M18 (Ensembl 93)

provider: GENCODE

contact: gencode-help@ebi.ac.uk

format: gtf

date: 2018-07-02

chr1 HAVANA gene 3073253 3074322 . + . gene_id "ENSMUSG00000102693.1"; gene_type "TEC"; gene_name "RP23-271O17.1"; level 2; havana_gene "OTTMUSG00000049935.1"; chr1 HAVANA transcript 3073253 3074322 . + . gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; ranscript_name "RP23-271O17.1-001"; level 2; transcript_support_level "NA"; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";

the temp and temp_merged.bed in .tempDir like this : _temp_xxx： chr1 133321874 133321976 . . +

_temp_xxx_merged: chr1 133619870 133621801 chr1 133622116 133622184 chr1 133622983 133623150

and the exon file like this: chr1 133619870 133621801 1 chr1 133622116 133622184 2 chr1 133622983 133623150 3 chr1 133624482 133624810 4 chr1 133625847 133625930 5 chr1 133626766 133627270 6

in you code you select exon.iloc[0,3] as strand,but when you define you strand_flags dict like this: self.strand_flags = {'+': [99, 147], '-': [83, 163]}

so ,what should i do to fix the problem ,or there is something wrong in my gtf file ?

Hi @loverlyday and @kwglam, I found the problem might come from bedtools versions. I used bedtools v2.26.0 and it gives me strand information if I run the command as below.

$ sort -k1,1 -k2,2n _temp_ENSMUSG00000092329.bed | bedtools merge -s -d -1 -i - chr8 119910840 119910939 + chr8 124295357 124295451 + chr8 124305514 124305668 + chr8 124323671 124323770 + chr8 124323991 124324059 + chr8 124324270 124324336 + chr8 124328091 124328213 + chr8 124329712 124329800 + chr8 124332008 124332096 + chr8 124332567 124332671 + chr8 124334274 124334401 + chr8 124336573 124336666 + chr8 124337261 124337345 + chr8 124338442 124338569 + chr8 124340818 124340938 + chr8 124343302 124345722 +

I tested another bedtools version v2.23.0. It does not output strand info… See below: $ sort -k1,1 -k2,2n _temp_ENSMUSG00000092329.bed | bedtools merge -s -d -1 -i - chr8 119910840 119910939 chr8 124295357 124295451 chr8 124305514 124305668 chr8 124323671 124323770 chr8 124323991 124324059 chr8 124324270 124324336 chr8 124328091 124328213 chr8 124329712 124329800 chr8 124332008 124332096 chr8 124332567 124332671 chr8 124334274 124334401 chr8 124336573 124336666 chr8 124337261 124337345 chr8 124338442 124338569 chr8 124340818 124340938 chr8 124343302 124345722

Hope it solves your issue. :=)

sandberg-lab / Smart-seq3