SPLiT-Seq data analysis – Error in assigning reads to UB & GENE

sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs

GNU General Public License v3.0

274 stars 67 forks source link

SPLiT-Seq data analysis – Error in assigning reads to UB & GENE #59

Closed Bioinfo1 closed 6 years ago

Bioinfo1 commented 6 years ago

Hi, Thanks for the zUMIs pipeline. I am using the current version (version 2.0.4) to analyze a SPLiT-Seq dataset, and it is giving the following error during the counting stage:

Error in sample.int(length(x), size, replace, prob) : vector size cannot be NA Calls: collectCounts ... [.data.table -> sample -> sample -> -> sample.int In addition: There were 31 warnings (use warnings() to see them) Execution halted.

Here is the full terminal output:

You provided these parameters:
 YAML file:     SPLiTseq_zUMI_20180824_Sample_TSBC07_BC1_to_BC4.yaml
 zUMIs directory:               /Apps/zUMIs
 STAR executable                STAR
 samtools executable            samtools
 pigz executable                pigz
 Rscript executable             Rscript
 RAM limit:   96
 zUMIs version 2.0.4

Filtering...
Mapping...
Aug 24 11:00:27 ..... started STAR run
Aug 24 11:00:27 ..... loading genome
Aug 24 11:00:44 ..... processing annotations GTF
Aug 24 11:00:51 ..... inserting junctions into the genome indices
Aug 24 11:04:06 ..... started 1st pass mapping
Aug 24 11:05:38 ..... finished 1st pass mapping
Aug 24 11:05:39 ..... inserting junctions into the genome indices
Aug 24 11:07:15 ..... started mapping
Aug 24 11:08:50 ..... finished successfully
Counting...
[1] "2018-08-24 11:09:00 BST"
Read 2368777 rows and 2 (of 2) columns from 0.073 GB file in 00:00:04
[1] "414 barcodes detected."
[1] "357978 reads were assigned to barcodes that do not correspond to intact cells."
[1] "4.32e+08 Reads per chunk"
[1] "Loading reference annotation from:"
[1] "gencode_v27_basic_annotation.gtf"
[1] "Annotation loaded!"
[1] "Assigning reads to features (ex)"

        ==========     _____ _    _ ____  _____  ______          _____
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 1.28.1

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           P /SPLiTseq_ ... ||
||                                                                            ||
||      Dir for temp files : .                                                ||
||                 Threads : 24                                               ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||         Strand specific : no                                               ||
||      Multimapping reads : primary only                                     ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : not counted                                      ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file ./.Rsubread_UserProvidedAnnotation_pid167733 ...      ||
||    Features : 683409                                                       ||
||    Meta-features : 50276                                                   ||
||    Chromosomes/contigs : 25                                                ||
||                                                                            ||
|| Process BAM file /SPLiTseq_zUMI_201808 ... ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Read re-ordering is performed.                                          ||
||                                                                            ||
||    Total fragments : 4146933                                               ||
||    Successfully assigned fragments : 184398 (4.4%)                         ||
||    Running time : 0.59 minutes                                             ||
||                                                                            ||
||                         Read assignment finished.                          ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

[bam_sort_core] merging from 0 files and 24 in-memory blocks...
[1] "Assigning reads to features (in)"

        ==========     _____ _    _ ____  _____  ______          _____
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 1.28.1

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           P /SPLiTseq_ ... ||
||                                                                            ||
||      Dir for temp files : .                                                ||
||                 Threads : 24                                               ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||         Strand specific : no                                               ||
||      Multimapping reads : primary only                                     ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : not counted                                      ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file ./.Rsubread_UserProvidedAnnotation_pid167733 ...      ||
||    Features : 253366                                                       ||
||    Meta-features : 28966                                                   ||
||    Chromosomes/contigs : 24                                                ||
||                                                                            ||
|| Process BAM file /SPLiTseq_zUMI_201808 ... ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Read re-ordering is performed.                                          ||
||                                                                            ||
||    Total fragments : 4146933                                               ||
||    Successfully assigned fragments : 546559 (13.2%)                        ||
||    Running time : 0.58 minutes                                             ||
||                                                                            ||
||                         Read assignment finished.                          ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

[bam_sort_core] merging from 0 files and 24 in-memory blocks...
[1] "Here are the detected subsampling options:"
[1] "Automatic downsampling"
[1] "Working on barcode chunk 1 out of 1"
[1] "Processing 414 barcodes in this chunk..."
Error in sample.int(length(x), size, replace, prob) :
  vector size cannot be NA
Calls: collectCounts ... [.data.table -> sample -> sample -> <Anonymous> -> sample.int
In addition: There were 31 warnings (use warnings() to see them)
Execution halted
Descriptive statistics...
[1] "I am loading useful packages for plotting..."
[1] "2018-08-24 11:12:40 BST"
Error in gzfile(file, "rb") : cannot open the connection
Calls: readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/SPLiTseq_zUMI_20180824/zUMIs_output/expression/TSBC07_BC1_to_BC4.dgecounts.rds', probable reason 'No such file or directory'
Execution halted

It looks there are issues in executing the collectCounts function or in its input (reads) from the preceding function reads2genes. Any ideas?

Many thanks, Mani

cziegenhain commented 6 years ago

Hey Mani,

Sorry that you are having an issue.

On the first look, there seems something odd with Rsubread mentioning paired end data.

Could you post your yaml file? Maybe that will give us a better idea.

Best, Christoph

Bioinfo1 commented 6 years ago

Hi Christoph, Thanks very much for examining the issue. Here is the YAML file:

project: TSBC07_BC1_to_BC4
sequence_files:
  file1:
    name: /outs/fastq_path/BCL2FastQ/TSBC07_S1_L001_R1_001.fastq.gz
    base_definition: cDNA(1-151)
  file2:
    name: /outs/fastq_path/BCL2FastQ/TSBC07_S1_L001_R2_001.fastq.gz
    base_definition:
    - cDNA(95-151)
    - BC(11-18,49-56,87-94)
    - UMI(1-10)
  file3:
    name: /outs/fastq_path/BCL2FastQ/TSBC07_S1_L001_I1_001.fastq.gz
    base_definition: BC(1-6)
reference:
  STAR_index: /Genomes/GRCh38/GencodeR27/Sequence/Indices/zUMI_star_index/
  GTF_file: /Genomes/GRCh38/GencodeR27/Annotation/gencode_v27_basic_annotation.gtf
  additional_STAR_params: ''
  additional_files: ~
out_dir: /SPLiTseq_zUMI_20180824
num_threads: 24
mem_limit: 96
filter_cutoffs:
  BC_filter:
    num_bases: 1
    phred: 10
  UMI_filter:
    num_bases: 1
    phred: 10
barcodes:
  barcode_num: ~
  barcode_file: ~
  BarcodeBinning: 0
  nReadsperCell: 100
counting_opts:
  introns: yes
  downsampling: '0'
  strand: 0
  Ham_Dist: 0
  velocyto: no
  primaryHit: yes
  twoPass: yes
make_stats: yes
which_Stage: Filtering
samtools_exec: samtools
pigz_exec: pigz
STAR_exec: STAR
zUMIs_directory: /Apps/zUMIs
read_layout: PE

Many thanks, Mani

cziegenhain commented 6 years ago

Hey,

Your YAML looks pretty good. Would you mind rerunning the whole pipeline without the cDNA definition in the second read? If that works fine we know more where to look closer.

Bioinfo1 commented 6 years ago

Also, Please find below few lines from the bam file generated by RSubread:

samtools view -h TSBC07_BC1_to_BC4.filtered.tagged.Aligned.out.bam.in.featureCounts.bam | less

@HD     VN:1.4  SO:queryname
@SQ     SN:chr1 LN:248956422
@SQ     SN:chr2 LN:242193529
@SQ     SN:chr3 LN:198295559
...............................................................
...............................................................
@SQ     SN:KI270756.1   LN:79590
@SQ     SN:KI270757.1   LN:71251
@PG     ID:STAR PN:STAR VN:STAR_2.5.4b  CL:STAR   --runThreadN 24   --genomeDir /Genomes/GRCh38/GencodeR27/Sequence/Indices/zUMI_star_index/   --readFilesType SAM   PE      --readFilesIn /SPLiTseq_zUMI_20180824/TSBC07
_BC1_to_BC4.filtered.tagged.bam      --readFilesCommand samtools   view      --outFileNamePrefix /SPLiTseq_zUMI_20180824/TSBC07_BC1_to_BC4.filtered.tagged.   --outSAMtype BAM   Unsorted      --outSAMunmapped Within      --outSAMmult
Nmax 1   --outFilterMultimapNmax 50   --sjdbGTFfile /Genomes/GRCh38/GencodeR27/Annotation/gencode_v27_basic_annotation.gtf   --sjdbOverhang 150   --twopassMode Basic
@CO     user command line: STAR --readFilesCommand samtools view --outSAMmultNmax 1 --outFilterMultimapNmax 50 --outSAMunmapped Within --outSAMtype BAM Unsorted --genomeDir /Genomes/GRCh38/GencodeR27/Sequence/Indices/zUMI_star_index/ --sjdbGTFfile /Genomes/
GRCh38/GencodeR27/Annotation/gencode_v27_basic_annotation.gtf --runThreadN 24 --readFilesIn /SPLiTseq_zUMI_20180824/TSBC07_BC1_to_BC4.filtered.tagged.bam --outFileNamePrefix /SP
LiTseq_zUMI_20180824/TSBC07_BC1_to_BC4.filtered.tagged. --sjdbOverhang 150 --readFilesType SAM PE --twopassMode Basic
NS500781:295:H3JC3AFXY:1:11101:1059:9206        77      *       0       0       *       *       0       0       NAGCTGAGATCACGCCACTGCACTCCAGCCTGGGTGACAGAGTGAGACTCTGCCACAAAAACATAATAATAGTAATAGTAAAAAAAAAAAAAAAAAACCATGCGAATGCTCTGCCCCCTGAGGAACGGGGAAACAGGGGTAGTCGTCCCCC #AAAAEEEAEEEEEEAEEEEEEE
EEEEEAEEEEE/EE<AEEEEAAEEEEEEEEA/EEEEAAEEAE/A//EEE</AEEEAAE/EE/AEEEEAEA//A/A////A////6/6//E///////////////E///<//</////////</</A< NH:i:0  HI:i:0  AS:i:78 nM:i:8  uT:A:1  BC:Z:ATCATTCCACCACTGTCAGCGTTACAGATC     UB:Z:GAGCATTGGA XS:Z:Unassigned_Unmapped
NS500781:295:H3JC3AFXY:1:11101:1059:9206        141     *       0       0       *       *       0       0       TTTTTTTTTTTTTTTAATATTAATATTATATTGTTTTTTTGGAAGATGATAAATAAT       <EA<6/6<A<EEEEE<///6/6/</////////</A<6/6<</66</////6/////       NH:i:0  HI:i:0  AS:i:78 nM:i:8  uT:A:1  BC:Z:AT
CATTCCACCACTGTCAGCGTTACAGATC     UB:Z:GAGCATTGGA XS:Z:Unassigned_Unmapped
NS500781:295:H3JC3AFXY:1:11101:1060:17524       77      *       0       0       *       *       0       0       NATTAACCTGAGTCGTACGCCGATGCGAAACATCGGCCACTCTCACGGCAGCTGGACCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTAAGCCGTCTTCTGCATGAAAAAAAAAAAGGGGGGGGGGGGGGGGGG #AAAAEEEAE/AEEEAEAEEEEA
EEAEEEEEAEEEEEEAEEE<EE<EAEEEEEEEEEEEE//<A/EEE/AAAEEEAEEEEEAEEEEAE/E<EAEEE/A//6/A<<A//6EAAE/<///<///<AAEEE///</A</A</<//A/<////A< NH:i:0  HI:i:0  AS:i:25 nM:i:2  uT:A:1  BC:Z:CCGTGAGACAGGTTAAGACGAGCGCAGATC     UB:Z:GGTCCAGCTG XS:Z:Unassigned_Unmapped
NS500781:295:H3JC3AFXY:1:11101:1060:17524       141     *       0       0       *       *       0       0       ATCTAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAGGGGGGGGGGGGG       /EA<<EA<<AA6<66<//6/<AAAEA/A<AA<EA/<A/<AEEA</6/E/<6//AAAA       NH:i:0  HI:i:0  AS:i:25 nM:i:2  uT:A:1  BC:Z:CC
GTGAGACAGGTTAAGACGAGCGCAGATC     UB:Z:GGTCCAGCTG XS:Z:Unassigned_Unmapped
NS500781:295:H3JC3AFXY:1:11101:1064:13892       77      *       0       0       *       *       0       0       NGCACGCTCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGTGTGAACCTGGGACACGGAGCTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGCGCGACAGACTGAGACTCCATCACAAAAAAAAAAAAAAAGA #/AAAEAEAEA/EEEE/EEA/E/
E/EEEE///AE//AEAEAAEE6EE////A//EE//6/////<EAE//EA6<AE/EA</<<A/A/EEE6/AAEE//EEEEEAEAEEAE/EEEEEA<E</</<//AA/6AA/<AAEE/EEEEEEE/AA// NH:i:0  HI:i:0  AS:i:122        nM:i:10 uT:A:1  BC:Z:AGCACCTCCTGATCTCCGCTGATCCAGAAC     UB:Z:CACAAATGTC XS:Z:Unassigned_Unmapped
NS500781:295:H3JC3AFXY:1:11101:1064:13892       141     *       0       0       *       *       0       0       TTTTTTTTTTTTTTTGATATGGAGTCACAATATTTAGCCAAGGAAGGAGTGAAGAGA       //6666<A<AEA/<</A//AEE//////<////A////////6//6/<AA</<////       NH:i:0  HI:i:0  AS:i:122        nM:i:10 uT:A:1
  BC:Z:AGCACCTCCTGATCTCCGCTGATCCAGAAC     UB:Z:CACAAATGTC XS:Z:Unassigned_Unmapped
NS500781:295:H3JC3AFXY:1:11101:1064:19342       83      chr20   34872662        255     149M2S  =       34863316        -9495   CCAGAACGGCGTCCTCGTCCTTGCGGCCCCTCCGCGGGGCCGCGGGCGCCGGCGCGTCCTCGGGCAGCCGCGGGAAGCTGGGGATGCTCATGTAGTCCACTGGCGAGTAGGCGCCCAGGGCGCTCTCCTGGCTGGCCTCGTGCTCCGCCCN /EE/E//
/EAE//E//A6///6/6E/6//<///6AA//A<A//<A<</A/AAA/AE/A//AEEEAEEA/EAAEE<E/</E////<EAAA//EEE6EEEEAE<EAAEEEEAEEEEEAAEAEAE//EEEE6EEEEEEEE/EEA/E6EEAAAA# NH:i:1  HI:i:1  AS:i:187        nM:i:8  BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC     UB:Z:GCCCGCCCCA XS:Z:Unassigned_NoFeatures
NS500781:295:H3JC3AFXY:1:11101:1064:19342       163     chr20   34863316        255     57M     =       34872662        9495    GGGTGCCCCGAAGTAGATCTGCATGACCAGCGCCACGGTGCCACCGGTAGCGAAGGT       AAE/<E6<6<666///<<AA<EA<</A//</A6A<//A6//<6<A/A/A/<A///</       NH:i:1  HI:i:1  AS:i:187
        nM:i:8  BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC     UB:Z:GCCCGCCCCA XS:Z:Unassigned_NoFeatures
NS500781:295:H3JC3AFXY:1:11101:1065:9143        77      *       0       0       *       *       0       0       NGTCTGGGACCAGCAGGCAAAGATGTATAAATATAAGTGCCTGTCTGGAGACAGGAACCTAAAAGCCAGTGGTTCTGAAACTTTAGTGCACATCTTAATCACCTGAAGGCTTGTTAAAATCCTAATTCTGGCCCCCGACAAACATGCTTTG #/AAAEEE6EEEAEE6E/EEEEE
//EAA/E//EEAE/EEAEA<<E<//EEAEE//EEEAAEE/EEA/AEEE/E6EEEEEE<EAE/EEEEEEEEEE/EE</<<<A/EEAA/AA<6<<EEEEEEE////A6///E<</6AAEE/<</AA//// NH:i:0  HI:i:0  AS:i:133        nM:i:4  uT:A:1  BC:Z:ACATTGGCATTGGCTCAACGTGATCAGATC     UB:Z:CCGCTATTCG XS:Z:Unassigned_Unmapped
NS500781:295:H3JC3AFXY:1:11101:1065:9143        141     *       0       0       *       *       0       0       TTTTTTTTTTTTATAAAAAAATAAAAACAAACAAAACAAAAAAAATAATGTATGAGC       6A/6//6<EE///<//////6//A/<A/A6///<A/</A/6///66//////////6       NH:i:0  HI:i:0  AS:i:133        nM:i:4  uT:A:1
  BC:Z:ACATTGGCATTGGCTCAACGTGATCAGATC     UB:Z:CCGCTATTCG XS:Z:Unassigned_Unmapped
NS500781:295:H3JC3AFXY:1:11101:1065:15685       77      *       0       0       *       *       0       0       NCCTTGATGAGTCGTACGCCGATGCGAAACATCGGCCACTAGAACACGCGGCCGACAAGATCGGAAGAGCACACTTCTGAACTCCAGTCACCAGAACAGCTCGGATGCCGTCTTCTGCTGTCAAAAAAAAACGGGGGGGGGGGGGGGGGGG #/A/A6/AA///EAEEAAEEE//
EEEEEEE6EE/EE/<EEAEEAA//AAE//E//AAAEEEEEEEEEA/A<EAE/EE<EAEEEEEE/AEAE</EE/E////<////A/A////A//<///</<6A<<EE////E//E/EEEEEEEEEAA/E NH:i:0  HI:i:0  AS:i:27 nM:i:5  uT:A:1  BC:Z:GTGTTCTACATCAAGTACGAGCGACAGATC     UB:Z:TGACTGCCGC XS:Z:Unassigned_Unmapped
NS500781:295:H3JC3AFXY:1:11101:1065:15685       141     *       0       0       *       *       0       0       TCTAGTGTAGATCTCGGTGGACGCCGTATCATTAAAAAAAAAAAGGGGGGGGGGGGG       /<////6//</6//66/A/</AA/AA<<<E/<A6/////6/<A//EAE<///<///A       NH:i:0  HI:i:0  AS:i:27 nM:i:5  uT:A:1  BC:Z:GT
GTTCTACATCAAGTACGAGCGACAGATC     UB:Z:TGACTGCCGC XS:Z:Unassigned_Unmapped
NS500781:295:H3JC3AFXY:1:11101:1066:12877       99      chr11   65182840        255     1S129M159N21M   =       65186228        3445    NGTGCGATGCCTGCAGAGTGGGACCCTCTTCCGTGATGAGGCCTTCCCCCCGGTACCCCAGAGCCTGGGTTACACGGACCTGGGTCCCAATTCCTCCAAGACCTATGGCATCATGTGGAAGCGTCCCACGGAACTGCTGTCAAACCCCCAG

Many thanks, Mani

Bioinfo1 commented 6 years ago

Thank you.

I will run the pipeline without the cDNA definition in the second read and update you.

cziegenhain commented 6 years ago

Cool thank you! The reads actually look as they should as far as I can tell. Could you also check the STAR final log of this PE run and the one without second cDNA read?

Bioinfo1 commented 6 years ago

Thank you.

I am re-running the pipeline in the SE mode (without cDNA definition on one read). I will update you the outcome as it finishes.

Here is the STAR final log for the PE run:

                             Started job on |   Aug 24 11:00:27
                         Started mapping on |   Aug 24 11:07:15
                                Finished on |   Aug 24 11:08:50
   Mapping speed, Million of reads per hour |   140.17

                      Number of input reads |   3698967
                  Average input read length |   208
                                UNIQUE READS:
               Uniquely mapped reads number |   694529
                    Uniquely mapped reads % |   18.78%
                      Average mapped length |   168.66
                   Number of splices: Total |   50072
        Number of splices: Annotated (sjdb) |   44925
                   Number of splices: GT/AG |   40642
                   Number of splices: GC/AG |   2603
                   Number of splices: AT/AC |   65
           Number of splices: Non-canonical |   6762
                  Mismatch rate per base, % |   1.96%
                     Deletion rate per base |   0.05%
                    Deletion average length |   1.42
                    Insertion rate per base |   0.02%
                   Insertion average length |   1.16
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   341923
         % of reads mapped to multiple loci |   9.24%
    Number of reads mapped to too many loci |   756
         % of reads mapped to too many loci |   0.02%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |   0.00%
             % of reads unmapped: too short |   71.90%
                 % of reads unmapped: other |   0.06%
                              CHIMERIC READS:
                   Number of chimeric reads |   0
                        % of chimeric reads |   0.00

Many thanks, Mani

Bioinfo1 commented 6 years ago

Hi Christoph,

In the SE mode, the pipeline ran without errors. The last few lines of the terminal output are:

[bam_sort_core] merging from 0 files and 24 in-memory blocks...
[1] "Here are the detected subsampling options:"
[1] "Automatic downsampling"
[1] "Working on barcode chunk 1 out of 1"
[1] "Processing 414 barcodes in this chunk..."
[1] "2018-08-24 23:35:09 BST"
[1] "I am done!! Look what I produced.../SPLiTseq_zUMI_20180824/zUMIs_output/"
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells  4958931 264.9    9968622 532.4  9968622 532.4
Vcells 12865137  98.2   29706687 226.7 57985435 442.4
Descriptive statistics...
[1] "I am loading useful packages for plotting..."
[1] "2018-08-24 23:35:11 BST"
Warning messages:
1: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
2: In bind_rows_(x, .id) :
  binding character and factor vector, coercing into character vector
3: In bind_rows_(x, .id) :
  binding character and factor vector, coercing into character vector
4: In bind_rows_(x, .id) :
  binding character and factor vector, coercing into character vector
Warning messages:
1: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
2: In bind_rows_(x, .id) :
  binding character and factor vector, coercing into character vector
3: In bind_rows_(x, .id) :
  binding character and factor vector, coercing into character vector
4: In bind_rows_(x, .id) :
  binding character and factor vector, coercing into character vector
^MRead 53.3% of 3698967 rows^MRead 79.5% of 3698967 rows^MRead 3698967 rows and 3 (of 3) columns from 0.179 GB file in 00:00:04
Warning message:
In bind_rows_(x, .id) :
  binding character and factor vector, coercing into character vector
          used (Mb) gc trigger  (Mb) max used  (Mb)
Ncells 1414295 75.6    4703850 251.3  4703850 251.3
Vcells 2797433 21.4   47254341 360.6 49550965 378.1

Please see below the STAR final log in the SE mode:

                                 Started job on |   Aug 24 23:24:39
                             Started mapping on |   Aug 24 23:30:58
                                    Finished on |   Aug 24 23:32:16
       Mapping speed, Million of reads per hour |   170.72

                          Number of input reads |   3698967
                      Average input read length |   151
                                    UNIQUE READS:
                   Uniquely mapped reads number |   1059890
                        Uniquely mapped reads % |   28.65%
                          Average mapped length |   142.63
                       Number of splices: Total |   64795
            Number of splices: Annotated (sjdb) |   55839
                       Number of splices: GT/AG |   50728
                       Number of splices: GC/AG |   3307
                       Number of splices: AT/AC |   99
               Number of splices: Non-canonical |   10661
                      Mismatch rate per base, % |   2.33%
                         Deletion rate per base |   0.05%
                        Deletion average length |   1.53
                        Insertion rate per base |   0.03%
                       Insertion average length |   1.24
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   363704
             % of reads mapped to multiple loci |   9.83%
        Number of reads mapped to too many loci |   119
             % of reads mapped to too many loci |   0.00%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |   0.00%
                 % of reads unmapped: too short |   61.46%
                     % of reads unmapped: other |   0.06%
                                  CHIMERIC READS:
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%

Many thanks, Mani

cziegenhain commented 6 years ago

Hey Mani,

sorry for the slow answer! That looks like there is some bug in the PE read counting. I will run some test data today on my side to find out more.

Apart from that it looks like in your case the addition of the mate makes things worse anyway, so maybe for this dataset I would recommend you SE analysis anyway! Since you have very long reads, I would also recommend to clip poly-A tails with star: additional_STAR_params: --clip3pAdapterSeq AAAAAAAAAAA

I will update here once I have found the bug!

Best, Christoph

Bioinfo1 commented 6 years ago

Hi Christoph,

Thank you. Indeed, trimming of reads might benefit.

One more thing I previously noted was that the pipeline fell through when I used _HamDist: 1. I don't know the error was related to the PE mode, but I will test it again in the SE mode. It would be good if you could please check that as well.

I appreciate your help very much.

Many thanks, Mani

cziegenhain commented 6 years ago

Hey Mani,

Both the PE & Hamming distance tests ran through fine for me.

Here is the log of the PE data:

Filtering...
Mapping...
Aug 25 19:10:23 ..... started STAR run
Aug 25 19:10:23 ..... loading genome
Aug 25 19:14:05 ..... processing annotations GTF
Aug 25 19:14:32 ..... inserting junctions into the genome indices
Aug 25 19:18:44 ..... started mapping
Aug 25 20:12:18 ..... finished successfully
Counting...
[1] "2018-08-25 20:12:36 CEST"
[1] "31 barcodes detected."
[1] "5467393 reads were assigned to barcodes that do not correspond to intact cells."
[1] "2.25e+08 Reads per chunk"
[1] "Loading reference annotation from:"
[1] "/mnt/kauffman/chrisz/resources/genomes/Human/Homo_sapiens.GRCh38.91.chr.gtf"
[1] "Annotation loaded!"
[1] "Assigning reads to features (ex)"

        ==========     _____ _    _ ____  _____  ______          _____
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 1.30.3

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           P PE.filtered.tagged.Aligned.out.bam             ||
||                                                                            ||
||             Output file : .Rsubread_featureCounts_pid18423                 ||
||                 Summary : .Rsubread_featureCounts_pid18423.summary         ||
||              Annotation : .Rsubread_UserProvidedAnnotation_pid18423 (SAF)  ||
||      Assignment details : <input_file>.featureCounts.bam                   ||
||                      (Note that files are saved to the output directory)   ||
||                                                                            ||
||      Dir for temp files : .                                                ||
||                 Threads : 24                                               ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||         Strand specific : no                                               ||
||      Multimapping reads : not counted                                      ||
||     Multiple alignments : primary alignment only                           ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : not counted                                      ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file .Rsubread_UserProvidedAnnotation_pid18423 ...         ||
||    Features : 1199596                                                      ||
||    Meta-features : 49558                                                   ||
||    Chromosomes/contigs : 25                                                ||
||                                                                            ||
|| Process BAM file PE.filtered.tagged.Aligned.out.bam...                     ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||    Total fragments : 258524806                                             ||
||    Successfully assigned fragments : 89699733 (34.7%)                      ||
||    Running time : 3.70 minutes                                             ||
||                                                                            ||
||                         Read assignment finished.                          ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

[bam_sort_core] merging from 72 files and 24 in-memory blocks...
[1] "Assigning reads to features (in)"

        ==========     _____ _    _ ____  _____  ______          _____
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 1.30.3

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           P PE.filtered.tagged.Aligned.out.bam             ||
||                                                                            ||
||             Output file : .Rsubread_featureCounts_pid18423                 ||
||                 Summary : .Rsubread_featureCounts_pid18423.summary         ||
||              Annotation : .Rsubread_UserProvidedAnnotation_pid18423 (SAF)  ||
||      Assignment details : <input_file>.featureCounts.bam                   ||
||                      (Note that files are saved to the output directory)   ||
||                                                                            ||
||      Dir for temp files : .                                                ||
||                 Threads : 24                                               ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||         Strand specific : no                                               ||
||      Multimapping reads : not counted                                      ||
||     Multiple alignments : primary alignment only                           ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : not counted                                      ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file .Rsubread_UserProvidedAnnotation_pid18423 ...         ||
||    Features : 275839                                                       ||
||    Meta-features : 28235                                                   ||
||    Chromosomes/contigs : 24                                                ||
||                                                                            ||
|| Process BAM file PE.filtered.tagged.Aligned.out.bam...                     ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||    Total fragments : 258524806                                             ||
||    Successfully assigned fragments : 30945865 (12.0%)                      ||
||    Running time : 4.70 minutes                                             ||
||                                                                            ||
||                         Read assignment finished.                          ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

[bam_sort_core] merging from 72 files and 24 in-memory blocks...
[1] "Here are the detected subsampling options:"
[1] "Automatic downsampling"
[1] "Working on barcode chunk 1 out of 2"
[1] "Processing 24 barcodes in this chunk..."
[1] "Working on barcode chunk 2 out of 2"
[1] "Processing 7 barcodes in this chunk..."
[1] "2018-08-25 23:02:31 CEST"
[1] "I am done!! Look what I produced.../mnt/kauffman/chrisz/test_zUMIs/PE//zUMIs_output/"
            used   (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells   5169763  276.1   12002346   641.0   12002346   641.0
Vcells 852538194 6504.4 3937035931 30037.3 4921272378 37546.4

Would you like to upload a million reads of your test data for me to test on?

Best,

Christoph

Bioinfo1 commented 6 years ago

Hi Christoph,

Thanks very much for testing the PE mode and the hamming distance options. Good to know that the tests went fine. I don't know if the version differences in Rsubread could cause the issue. Between versions 1.28 and 1.30, the following new features have been added in Rsubread:

o New parameters in featureCounts() to give more control on the size of overlap between read and feature - 'nonOverlap' and 'nonOverlapFeature'.

o New parameter in featureCounts() to specify the path where files containing counting results for each read are saved - 'reportReadsPath'.

o New parameter in align() and subjunc() to allow reads in the mapping output to have the same order as in the FASTQ file - 'keepReadOrder'.

Unless you use these options, it shouldn't make any difference. I am currently using R 3.4.3 and Bioconductor 3.6. To update Rsubread, I should update R and Bioconductor. I will test if this will make any difference.

I will send you a million reads after obtaining necessary approval.

Many thanks, Mani

cziegenhain commented 6 years ago

Good! We don't use any of these new options but sometimes there can be small updates that don't end up in the changelog. To our knowledge 1.28.1 should be fine though.

Best, Christoph

Bioinfo1 commented 6 years ago

Hi Christoph,

I am still waiting for approval to share a million reads with you. I will update you as soon as I can.

Meanwhile, there seems to be an issue with the reads2genes function. The count matrix generated by the pipeline for introns has the concatenated barcodes as the gene ids instead of the actual gene id. The exons count matrix has the correct gene ids.

I tried to debug the issue and found that splitting of the annotation into exons and introns “Get non-overlapping Introns/Exons" - (*.annotationsSAF.rds) seems OK:

> annot_lists <- readRDS(file = "zUMIs_output/expression/TSBC07_test_gtf.annotationsSAF.rds");
> head(annot_lists$introns$GeneID)
[1] "ENSG00000223972.5" "ENSG00000223972.5" "ENSG00000223972.5"
[4] "ENSG00000227232.5" "ENSG00000227232.5" "ENSG00000227232.5"

Similarly, it seems OK in the filtered.tagged.Aligned.out.bam.in.featureCounts.bam file:

NS500781:295:H3JC3AFXY:1:11101:1111:11430       16      chr22   35828896        255     1S150M  *       0       0       AAAAAAATTAGCCAGGCGTGGTGGCACGCACCTGTAATCCCCGCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCCGGGAGGCAGAGGTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCC /EEEEA/A//<EA6EEEE<<A/AA/6/AA/EEAE/<A//<//EEA///EAEE/E<A/A<AEE//EAE<E/A////EEEEEAE/<EEA<<AAE//EEAA</AA/AE/EAEEEEEEEEEAEEAEE/EEEAEE<EEEEEEEEEEEEEAEAAAAA NH:i:1  HI:i:1  AS:i:144        nM:i:2  BC:Z:CGCTGATCATGAGATCCATCTAGTCAGATC     UB:Z:CTCCCGGTTC XS:Z:Assigned   XN:i:1  XT:Z:ENSG00000100320.22
NS500781:295:H3JC3AFXY:1:11101:1111:15628       4       *       0       0       *       *       0       0       GCATAGTAGTCGTACGCCGATGCGAAACATCGGCCACTCTGCTGTCCTTTTTTATAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCGCGTATGCCGGCGTCGGCGTGAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGG /AAAAEEEEEAEEAEEEEEEEAEEEEEEEEEEEEE<EEE/AEE<EEEEEEEEEEEEEAEAEEEEEEEE/EEEA<E/EE/EEEE/E/E/EA/EAEE/<////<////A////<//EA<E</<<EEA/////EA/AEEEEEEEAEEEEEEAE/ NH:i:0  HI:i:0  AS:i:20 nM:i:2  uT:A:1  BC:Z:ACAGCAGATCTAAGCCGAGCGATCCAGATC     UB:Z:ATAAAAAAGG XS:Z:Unassigned_Unmapped
NS500781:295:H3JC3AFXY:1:11101:1111:17616       4       *       0       0       *       *       0       0       GATTTGGTATGAGTCGTACGCCGATGCGAAACATCGGCCACTCGTTAGCCGGCTGGGATAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCGTGAAAAAAAAAAGGGGGGGGGGGGGGGGGG AAAAAEAEEEEEEEEAEAEEEEE/EEEEEAEEEEEEEEE/AAEE6AE/E/AAEEEEE/EEEAEEEEE<EEE<EEEEEEE<EEAEEEAEEEAAEE/AEEEA////E//A/6/A//A<<<AE//AA<AE////6//<6AAEAEEEEEEEEE<< NH:i:0  HI:i:0  AS:i:17 nM:i:0  uT:A:1  BC:Z:GCTAACGACATACCAACGACGAGCCAGATC     UB:Z:ATCCCAGCCG XS:Z:Unassigned_Unmapped
NS500781:295:H3JC3AFXY:1:11101:1111:17698       16      chr17   688000  255     2S149M  *       0       0       TTCCTTCACTGTACATAGAGGAGAGAAGATAGCTATTTGGATTCCTTATAGATTTTCTAAAAATGATGGATTTTAGAGGATACGTGGCAAAGTGTTCCAAATATGCCGCAGACATGTCTCCCTGCCAAAAAAGCGCTATCAAGGTGATATG /A/<</<//<6EE6<//AEE/<A<AAEEE<A6<E<EAA/<6<E/EEEA/A<//////EEA/EEEEE/EEEEEEAE<E/EE/A/E/<6/AEEEE<6A/EAEEEEEEE/AAEAEEEAEEEEEEEEEEAEEE/EE/EEEEE/EEEEAEEAAAAA NH:i:1  HI:i:1  AS:i:143        nM:i:2  BC:Z:CGACACACAAGACGGAAACCGAGACAGATC     UB:Z:CTGTCTGCGC XS:Z:Assigned   XN:i:1  XT:Z:ENSG00000141252.19

However, the ‘reads’ object from the reads2genes function has the concatenated barcodes instead of gene ids for introns:

"RG" "UB" "GE" "ftype"
"1" "AAACATCGACACAGAAAAGACGGACAGATC" "CCCACCCACC" "AAACATCGACACAGAAAAGACGGACAGATC   UB" "intron"
"2" "AAACATCGACACAGAAAAGACGGACAGATC" "CCGCTCATCC" "ENSG00000243468.5" "exon"
"3" "AAACATCGACACAGAAAAGACGGACAGATC" "CCCGGCGCCC" "ENSG00000100376.11" "exon"
"4" "AAACATCGACACAGAAAAGACGGACAGATC" "CTTACCTCTT" "AAACATCGACACAGAAAAGACGGACAGATC   UB" "intron"
"5" "AAACATCGACACAGAAAAGACGGACAGATC" "CGCACCAGCC" "AAACATCGACACAGAAAAGACGGACAGATC   UB" "intron"
"6" "AAACATCGACACAGAAAAGACGGACAGATC" "TCGTATTTCG" "ENSG00000281181.1" "exon"
"7" "AAACATCGACACAGAAAAGACGGACAGATC" "GTGGTTGCTT" "AAACATCGACACAGAAAAGACGGACAGATC   UB" "intron"
"8" "AAACATCGACACAGAAAAGACGGACAGATC" "CCTCTCCCCC" "AAACATCGACACAGAAAAGACGGACAGATC   UB" "intron"
"9" "AAACATCGACACAGAAAAGACGGACAGATC" "CGCACCAGCC" "AAACATCGACACAGAAAAGACGGACAGATC   UB" "intron"
"10" "AAACATCGACACAGAAAAGACGGACAGATC" "CTCTCGATCC" "AAACATCGACACAGAAAAGACGGACAGATC  UB" "intron"

Here is the output from “.dgecounts.rds”:

> countMatrix_lists <- readRDS(file = "zUMIs_output/expression/TSBC07_test_gtf.dgecounts.rds");
> dge_umi_exon_all <- as.data.frame(as.matrix(countMatrix_lists$umicount$exon$all));
Loading required package: Matrix
> dge_umi_intron_all <- as.data.frame(as.matrix(countMatrix_lists$umicount$intron$all));
> dge_umi_exon_all[1:6, 1:6];
                   AAACATCGACACAGAAAAGACGGACAGATC
ENSG00000000419.12                              0
ENSG00000000460.16                              0
ENSG00000001084.10                              0
ENSG00000001461.16                              0
ENSG00000001497.16                              0
ENSG00000001629.9                               0
                   AAACATCGATAGCGACAAGACGGACAGATC
ENSG00000000419.12                              0
ENSG00000000460.16                              0
ENSG00000001084.10                              0
ENSG00000001461.16                              0
ENSG00000001497.16                              0
ENSG00000001629.9                               0
                   AAACATCGTAGGATGAAGTACAAGCAGATC
ENSG00000000419.12                              0
ENSG00000000460.16                              0
ENSG00000001084.10                              0
ENSG00000001461.16                              0
ENSG00000001497.16                              0
ENSG00000001629.9                               0
                   AACAACCAAACGCTTAAAACATCGCAGATC
ENSG00000000419.12                              0
ENSG00000000460.16                              0
ENSG00000001084.10                              0
ENSG00000001461.16                              0
ENSG00000001497.16                              0
ENSG00000001629.9                               0
                   AACAACCAAAGAGATCAACGTGATCAGATC
ENSG00000000419.12                              0
ENSG00000000460.16                              0
ENSG00000001084.10                              0
ENSG00000001461.16                              0
ENSG00000001497.16                              0
ENSG00000001629.9                               0
                   AACAACCATAGGATGAAACGCTTACAGATC
ENSG00000000419.12                              0
ENSG00000000460.16                              0
ENSG00000001084.10                              0
ENSG00000001461.16                              0
ENSG00000001497.16                              0
ENSG00000001629.9                               0
> dge_umi_intron_all[1:6, 1:6];
                                   AAACATCGACACAGAAAAGACGGACAGATC
AAACATCGACACAGAAAAGACGGACAGATC\tUB                            126
AAACATCGATAGCGACAAGACGGACAGATC\tUB                              0
AAACATCGTAGGATGAAGTACAAGCAGATC\tUB                              0
AACAACCAAACGCTTAAAACATCGCAGATC\tUB                              0
AACAACCAAAGAGATCAACGTGATCAGATC\tUB                              0
AACAACCAAAGAGATCCGACGAGCCAGATC\tUB                              0
                                   AAACATCGATAGCGACAAGACGGACAGATC
AAACATCGACACAGAAAAGACGGACAGATC\tUB                              0
AAACATCGATAGCGACAAGACGGACAGATC\tUB                            160
AAACATCGTAGGATGAAGTACAAGCAGATC\tUB                              0
AACAACCAAACGCTTAAAACATCGCAGATC\tUB                              0
AACAACCAAAGAGATCAACGTGATCAGATC\tUB                              0
AACAACCAAAGAGATCCGACGAGCCAGATC\tUB                              0
                                   AAACATCGTAGGATGAAGTACAAGCAGATC
AAACATCGACACAGAAAAGACGGACAGATC\tUB                              0
AAACATCGATAGCGACAAGACGGACAGATC\tUB                              0
AAACATCGTAGGATGAAGTACAAGCAGATC\tUB                            355
AACAACCAAACGCTTAAAACATCGCAGATC\tUB                              0
AACAACCAAAGAGATCAACGTGATCAGATC\tUB                              0
AACAACCAAAGAGATCCGACGAGCCAGATC\tUB                              0
                                   AACAACCAAACGCTTAAAACATCGCAGATC
AAACATCGACACAGAAAAGACGGACAGATC\tUB                              0
AAACATCGATAGCGACAAGACGGACAGATC\tUB                              0
AAACATCGTAGGATGAAGTACAAGCAGATC\tUB                              0
AACAACCAAACGCTTAAAACATCGCAGATC\tUB                            176
AACAACCAAAGAGATCAACGTGATCAGATC\tUB                              0
AACAACCAAAGAGATCCGACGAGCCAGATC\tUB                              0
                                   AACAACCAAAGAGATCAACGTGATCAGATC
AAACATCGACACAGAAAAGACGGACAGATC\tUB                              0
AAACATCGATAGCGACAAGACGGACAGATC\tUB                              0
AAACATCGTAGGATGAAGTACAAGCAGATC\tUB                              0
AACAACCAAACGCTTAAAACATCGCAGATC\tUB                              0
AACAACCAAAGAGATCAACGTGATCAGATC\tUB                           1036
AACAACCAAAGAGATCCGACGAGCCAGATC\tUB                              0
                                   AACAACCAAAGAGATCCGACGAGCCAGATC
AAACATCGACACAGAAAAGACGGACAGATC\tUB                              0
AAACATCGATAGCGACAAGACGGACAGATC\tUB                              0
AAACATCGTAGGATGAAGTACAAGCAGATC\tUB                              0
AACAACCAAACGCTTAAAACATCGCAGATC\tUB                              0
AACAACCAAAGAGATCAACGTGATCAGATC\tUB                              0
AACAACCAAAGAGATCCGACGAGCCAGATC\tUB                            468
>

Could you please check this? Many thanks, Mani

cziegenhain commented 6 years ago

Hey Mani,

yes I am looking into this. So it seems to me that while reading in from the bam file something goes awry. In my Split-seq test the intron table and reads object are as expected.

dge_umi_intron_all[1:6, 1:6];
                AAACATCGACACAGAAACAGCAGA AAACATCGACACAGAATGGTGGTA
ENSG00000000003                        0                        0
ENSG00000000419                        0                        0
ENSG00000000457                        0                        0
ENSG00000000460                        0                        0
ENSG00000000938                        0                        0
ENSG00000000971                        0                        0
                AAACATCGCCTCTATCCACCTTAC AAACATCGCCTCTATCCCAGTTCA
ENSG00000000003                        0                        0
ENSG00000000419                        0                        0
ENSG00000000457                        0                        0
ENSG00000000460                        0                        0
ENSG00000000938                        0                        0
ENSG00000000971                        0                        0
                AACAACCAATCCTGTAAGGCTAAC AACAACCAATCCTGTAGACTAGTA
ENSG00000000003                        0                        0
ENSG00000000419                        0                        2
ENSG00000000457                        0                        2
ENSG00000000460                        0                        1
ENSG00000000938                        0                        0
ENSG00000000971                        0                        0

Now that I look at your verbose (thanks for posting all the information!!), it seems like the field delimiting is not correct. I could imagine that data.table::fread gets thrown off by the . in the read names. To check this further:

Whats your data.table version?
Can you rerun with a GTF file without these dots?

I will try to reconstruct this problem (hope i get around to it this week). maybe its easily fixed by adding , sep="\t" in line 63 of the reading function Since you seem pretty proficient in R, you are also welcome to try that!

Bioinfo1 commented 6 years ago

Hi Christoph,

Thank you very much for the quick reply and the suggestions.

I am using data.table 1.11.5. Indeed, I checked the line 63 as it is where the issue arises. I think this line is fine:

samtools view -x NH -x AS -x nM -x HI -x IH -x NM -x uT -x MD -x jM -x jI -x XN -x XS -@ 1 TSBC07_test_gtf.filtered.tagged.Aligned.out.bam.in.featureCounts.bam | cut -f12,13,14 | grep -F CTGTAGCCGACAGTGCCAGATCTGCAGATC | sed 's/XT:Z://' | less

BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC     UB:Z:GCCCGCCCCA
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC     UB:Z:GTTTGCTCTT ENSG00000171988.18
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC     UB:Z:TCCTTTCCCC
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC     UB:Z:CTGGGATGAT
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC     UB:Z:CCTCACCTCC
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC     UB:Z:CCATGGCACG
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC     UB:Z:CCGCCCCGGC
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC     UB:Z:GTTTTATACT ENSG00000123106.10
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC     UB:Z:CCCCCCCCTC
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC     UB:Z:CGGGGTAGGT
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC     UB:Z:GGGCTGTTGT
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC     UB:Z:CCGGCCGCTG ENSG00000002822.15
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC     UB:Z:GGCTTCGGGG
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC     UB:Z:CTGGGCTTGG ENSG00000086200.16
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC     UB:Z:CCCGCTCCGG
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC     UB:Z:GTCAGGCTTA ENSG00000144229.11
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC     UB:Z:TGTTTGTTGC
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC     UB:Z:GTCCCTCTCG
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC     UB:Z:CTCTTATCAC
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC     UB:Z:CGCCCTGTCC
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC     UB:Z:CTGGGCTTGG ENSG00000086200.16
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC     UB:Z:TCGCTTTCTT ENSG00000021574.11

Unfortunately, adding sep = “\t” didn’t help. I also thought that the Gencode gtf with suffix .xx (Ensembl gene versions) could be the issue. However, the count table for exons looks OK. The issue is only with the intron table. Anyway, I have tested with different gtf files, and they produce the same issue.

Many thanks, Mani

niehu2018 commented 6 years ago

What is your command and config to analysis split-seq data using zUMI? It seems that I have gone through this problem.

Bioinfo1 commented 6 years ago

Hi Christoph,

I think I have traced the source of the issue. Indeed the issue is due to an old version of data.table that was called by the ‘Rscript’ function used to invoke R.

I have multiple R installations on the server and I usually use the installation in my home directory (which I used for debuging - and it looked fine obviously!). I have included the R-3.4.3/bin/R in my $PATH which is fine for invoking R by entering R in the terminal or in the CMD mode. However, since zUMIs uses #!/usr/bin/env Rscript it was calling R from the installation in the system root directories. I have now included R-3.4.3/bin/Rscript in my $PATH, and it seems running OK in the SE mode. I still have errors while trying to use the PE mode (it went fine for a subset of reads I used for testing though!). I will post an update as soon as possible.

Many thanks, Mani

cziegenhain commented 6 years ago

Hey Mani,

okay good it sounds that you are coming closer to the source of the problem. Maybe we can give a short verbose to users which R is starting up in each zUMIs call from the next update. Keep me posted & let me know if we can help more.

sdparekh commented 6 years ago

Hey Mani,

Since you did not get back with the issue, I am assuming you have found a solution. I am closing this issue for now but do not hesitate to re-open it if need be.

Thank you Swati