Closed Bioinfo1 closed 6 years ago
Hey Mani,
Sorry that you are having an issue.
On the first look, there seems something odd with Rsubread mentioning paired end data.
Could you post your yaml file? Maybe that will give us a better idea.
Best, Christoph
Hi Christoph, Thanks very much for examining the issue. Here is the YAML file:
project: TSBC07_BC1_to_BC4
sequence_files:
file1:
name: /outs/fastq_path/BCL2FastQ/TSBC07_S1_L001_R1_001.fastq.gz
base_definition: cDNA(1-151)
file2:
name: /outs/fastq_path/BCL2FastQ/TSBC07_S1_L001_R2_001.fastq.gz
base_definition:
- cDNA(95-151)
- BC(11-18,49-56,87-94)
- UMI(1-10)
file3:
name: /outs/fastq_path/BCL2FastQ/TSBC07_S1_L001_I1_001.fastq.gz
base_definition: BC(1-6)
reference:
STAR_index: /Genomes/GRCh38/GencodeR27/Sequence/Indices/zUMI_star_index/
GTF_file: /Genomes/GRCh38/GencodeR27/Annotation/gencode_v27_basic_annotation.gtf
additional_STAR_params: ''
additional_files: ~
out_dir: /SPLiTseq_zUMI_20180824
num_threads: 24
mem_limit: 96
filter_cutoffs:
BC_filter:
num_bases: 1
phred: 10
UMI_filter:
num_bases: 1
phred: 10
barcodes:
barcode_num: ~
barcode_file: ~
BarcodeBinning: 0
nReadsperCell: 100
counting_opts:
introns: yes
downsampling: '0'
strand: 0
Ham_Dist: 0
velocyto: no
primaryHit: yes
twoPass: yes
make_stats: yes
which_Stage: Filtering
samtools_exec: samtools
pigz_exec: pigz
STAR_exec: STAR
zUMIs_directory: /Apps/zUMIs
read_layout: PE
Many thanks, Mani
Hey,
Your YAML looks pretty good. Would you mind rerunning the whole pipeline without the cDNA definition in the second read? If that works fine we know more where to look closer.
Also, Please find below few lines from the bam file generated by RSubread:
samtools view -h TSBC07_BC1_to_BC4.filtered.tagged.Aligned.out.bam.in.featureCounts.bam | less
@HD VN:1.4 SO:queryname
@SQ SN:chr1 LN:248956422
@SQ SN:chr2 LN:242193529
@SQ SN:chr3 LN:198295559
...............................................................
...............................................................
@SQ SN:KI270756.1 LN:79590
@SQ SN:KI270757.1 LN:71251
@PG ID:STAR PN:STAR VN:STAR_2.5.4b CL:STAR --runThreadN 24 --genomeDir /Genomes/GRCh38/GencodeR27/Sequence/Indices/zUMI_star_index/ --readFilesType SAM PE --readFilesIn /SPLiTseq_zUMI_20180824/TSBC07
_BC1_to_BC4.filtered.tagged.bam --readFilesCommand samtools view --outFileNamePrefix /SPLiTseq_zUMI_20180824/TSBC07_BC1_to_BC4.filtered.tagged. --outSAMtype BAM Unsorted --outSAMunmapped Within --outSAMmult
Nmax 1 --outFilterMultimapNmax 50 --sjdbGTFfile /Genomes/GRCh38/GencodeR27/Annotation/gencode_v27_basic_annotation.gtf --sjdbOverhang 150 --twopassMode Basic
@CO user command line: STAR --readFilesCommand samtools view --outSAMmultNmax 1 --outFilterMultimapNmax 50 --outSAMunmapped Within --outSAMtype BAM Unsorted --genomeDir /Genomes/GRCh38/GencodeR27/Sequence/Indices/zUMI_star_index/ --sjdbGTFfile /Genomes/
GRCh38/GencodeR27/Annotation/gencode_v27_basic_annotation.gtf --runThreadN 24 --readFilesIn /SPLiTseq_zUMI_20180824/TSBC07_BC1_to_BC4.filtered.tagged.bam --outFileNamePrefix /SP
LiTseq_zUMI_20180824/TSBC07_BC1_to_BC4.filtered.tagged. --sjdbOverhang 150 --readFilesType SAM PE --twopassMode Basic
NS500781:295:H3JC3AFXY:1:11101:1059:9206 77 * 0 0 * * 0 0 NAGCTGAGATCACGCCACTGCACTCCAGCCTGGGTGACAGAGTGAGACTCTGCCACAAAAACATAATAATAGTAATAGTAAAAAAAAAAAAAAAAAACCATGCGAATGCTCTGCCCCCTGAGGAACGGGGAAACAGGGGTAGTCGTCCCCC #AAAAEEEAEEEEEEAEEEEEEE
EEEEEAEEEEE/EE<AEEEEAAEEEEEEEEA/EEEEAAEEAE/A//EEE</AEEEAAE/EE/AEEEEAEA//A/A////A////6/6//E///////////////E///<//</////////</</A< NH:i:0 HI:i:0 AS:i:78 nM:i:8 uT:A:1 BC:Z:ATCATTCCACCACTGTCAGCGTTACAGATC UB:Z:GAGCATTGGA XS:Z:Unassigned_Unmapped
NS500781:295:H3JC3AFXY:1:11101:1059:9206 141 * 0 0 * * 0 0 TTTTTTTTTTTTTTTAATATTAATATTATATTGTTTTTTTGGAAGATGATAAATAAT <EA<6/6<A<EEEEE<///6/6/</////////</A<6/6<</66</////6///// NH:i:0 HI:i:0 AS:i:78 nM:i:8 uT:A:1 BC:Z:AT
CATTCCACCACTGTCAGCGTTACAGATC UB:Z:GAGCATTGGA XS:Z:Unassigned_Unmapped
NS500781:295:H3JC3AFXY:1:11101:1060:17524 77 * 0 0 * * 0 0 NATTAACCTGAGTCGTACGCCGATGCGAAACATCGGCCACTCTCACGGCAGCTGGACCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTAAGCCGTCTTCTGCATGAAAAAAAAAAAGGGGGGGGGGGGGGGGGG #AAAAEEEAE/AEEEAEAEEEEA
EEAEEEEEAEEEEEEAEEE<EE<EAEEEEEEEEEEEE//<A/EEE/AAAEEEAEEEEEAEEEEAE/E<EAEEE/A//6/A<<A//6EAAE/<///<///<AAEEE///</A</A</<//A/<////A< NH:i:0 HI:i:0 AS:i:25 nM:i:2 uT:A:1 BC:Z:CCGTGAGACAGGTTAAGACGAGCGCAGATC UB:Z:GGTCCAGCTG XS:Z:Unassigned_Unmapped
NS500781:295:H3JC3AFXY:1:11101:1060:17524 141 * 0 0 * * 0 0 ATCTAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAGGGGGGGGGGGGG /EA<<EA<<AA6<66<//6/<AAAEA/A<AA<EA/<A/<AEEA</6/E/<6//AAAA NH:i:0 HI:i:0 AS:i:25 nM:i:2 uT:A:1 BC:Z:CC
GTGAGACAGGTTAAGACGAGCGCAGATC UB:Z:GGTCCAGCTG XS:Z:Unassigned_Unmapped
NS500781:295:H3JC3AFXY:1:11101:1064:13892 77 * 0 0 * * 0 0 NGCACGCTCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGTGTGAACCTGGGACACGGAGCTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGCGCGACAGACTGAGACTCCATCACAAAAAAAAAAAAAAAGA #/AAAEAEAEA/EEEE/EEA/E/
E/EEEE///AE//AEAEAAEE6EE////A//EE//6/////<EAE//EA6<AE/EA</<<A/A/EEE6/AAEE//EEEEEAEAEEAE/EEEEEA<E</</<//AA/6AA/<AAEE/EEEEEEE/AA// NH:i:0 HI:i:0 AS:i:122 nM:i:10 uT:A:1 BC:Z:AGCACCTCCTGATCTCCGCTGATCCAGAAC UB:Z:CACAAATGTC XS:Z:Unassigned_Unmapped
NS500781:295:H3JC3AFXY:1:11101:1064:13892 141 * 0 0 * * 0 0 TTTTTTTTTTTTTTTGATATGGAGTCACAATATTTAGCCAAGGAAGGAGTGAAGAGA //6666<A<AEA/<</A//AEE//////<////A////////6//6/<AA</<//// NH:i:0 HI:i:0 AS:i:122 nM:i:10 uT:A:1
BC:Z:AGCACCTCCTGATCTCCGCTGATCCAGAAC UB:Z:CACAAATGTC XS:Z:Unassigned_Unmapped
NS500781:295:H3JC3AFXY:1:11101:1064:19342 83 chr20 34872662 255 149M2S = 34863316 -9495 CCAGAACGGCGTCCTCGTCCTTGCGGCCCCTCCGCGGGGCCGCGGGCGCCGGCGCGTCCTCGGGCAGCCGCGGGAAGCTGGGGATGCTCATGTAGTCCACTGGCGAGTAGGCGCCCAGGGCGCTCTCCTGGCTGGCCTCGTGCTCCGCCCN /EE/E//
/EAE//E//A6///6/6E/6//<///6AA//A<A//<A<</A/AAA/AE/A//AEEEAEEA/EAAEE<E/</E////<EAAA//EEE6EEEEAE<EAAEEEEAEEEEEAAEAEAE//EEEE6EEEEEEEE/EEA/E6EEAAAA# NH:i:1 HI:i:1 AS:i:187 nM:i:8 BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC UB:Z:GCCCGCCCCA XS:Z:Unassigned_NoFeatures
NS500781:295:H3JC3AFXY:1:11101:1064:19342 163 chr20 34863316 255 57M = 34872662 9495 GGGTGCCCCGAAGTAGATCTGCATGACCAGCGCCACGGTGCCACCGGTAGCGAAGGT AAE/<E6<6<666///<<AA<EA<</A//</A6A<//A6//<6<A/A/A/<A///</ NH:i:1 HI:i:1 AS:i:187
nM:i:8 BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC UB:Z:GCCCGCCCCA XS:Z:Unassigned_NoFeatures
NS500781:295:H3JC3AFXY:1:11101:1065:9143 77 * 0 0 * * 0 0 NGTCTGGGACCAGCAGGCAAAGATGTATAAATATAAGTGCCTGTCTGGAGACAGGAACCTAAAAGCCAGTGGTTCTGAAACTTTAGTGCACATCTTAATCACCTGAAGGCTTGTTAAAATCCTAATTCTGGCCCCCGACAAACATGCTTTG #/AAAEEE6EEEAEE6E/EEEEE
//EAA/E//EEAE/EEAEA<<E<//EEAEE//EEEAAEE/EEA/AEEE/E6EEEEEE<EAE/EEEEEEEEEE/EE</<<<A/EEAA/AA<6<<EEEEEEE////A6///E<</6AAEE/<</AA//// NH:i:0 HI:i:0 AS:i:133 nM:i:4 uT:A:1 BC:Z:ACATTGGCATTGGCTCAACGTGATCAGATC UB:Z:CCGCTATTCG XS:Z:Unassigned_Unmapped
NS500781:295:H3JC3AFXY:1:11101:1065:9143 141 * 0 0 * * 0 0 TTTTTTTTTTTTATAAAAAAATAAAAACAAACAAAACAAAAAAAATAATGTATGAGC 6A/6//6<EE///<//////6//A/<A/A6///<A/</A/6///66//////////6 NH:i:0 HI:i:0 AS:i:133 nM:i:4 uT:A:1
BC:Z:ACATTGGCATTGGCTCAACGTGATCAGATC UB:Z:CCGCTATTCG XS:Z:Unassigned_Unmapped
NS500781:295:H3JC3AFXY:1:11101:1065:15685 77 * 0 0 * * 0 0 NCCTTGATGAGTCGTACGCCGATGCGAAACATCGGCCACTAGAACACGCGGCCGACAAGATCGGAAGAGCACACTTCTGAACTCCAGTCACCAGAACAGCTCGGATGCCGTCTTCTGCTGTCAAAAAAAAACGGGGGGGGGGGGGGGGGGG #/A/A6/AA///EAEEAAEEE//
EEEEEEE6EE/EE/<EEAEEAA//AAE//E//AAAEEEEEEEEEA/A<EAE/EE<EAEEEEEE/AEAE</EE/E////<////A/A////A//<///</<6A<<EE////E//E/EEEEEEEEEAA/E NH:i:0 HI:i:0 AS:i:27 nM:i:5 uT:A:1 BC:Z:GTGTTCTACATCAAGTACGAGCGACAGATC UB:Z:TGACTGCCGC XS:Z:Unassigned_Unmapped
NS500781:295:H3JC3AFXY:1:11101:1065:15685 141 * 0 0 * * 0 0 TCTAGTGTAGATCTCGGTGGACGCCGTATCATTAAAAAAAAAAAGGGGGGGGGGGGG /<////6//</6//66/A/</AA/AA<<<E/<A6/////6/<A//EAE<///<///A NH:i:0 HI:i:0 AS:i:27 nM:i:5 uT:A:1 BC:Z:GT
GTTCTACATCAAGTACGAGCGACAGATC UB:Z:TGACTGCCGC XS:Z:Unassigned_Unmapped
NS500781:295:H3JC3AFXY:1:11101:1066:12877 99 chr11 65182840 255 1S129M159N21M = 65186228 3445 NGTGCGATGCCTGCAGAGTGGGACCCTCTTCCGTGATGAGGCCTTCCCCCCGGTACCCCAGAGCCTGGGTTACACGGACCTGGGTCCCAATTCCTCCAAGACCTATGGCATCATGTGGAAGCGTCCCACGGAACTGCTGTCAAACCCCCAG
Many thanks, Mani
Thank you.
I will run the pipeline without the cDNA definition in the second read and update you.
Cool thank you! The reads actually look as they should as far as I can tell. Could you also check the STAR final log of this PE run and the one without second cDNA read?
Thank you.
I am re-running the pipeline in the SE mode (without cDNA definition on one read). I will update you the outcome as it finishes.
Here is the STAR final log for the PE run:
Started job on | Aug 24 11:00:27
Started mapping on | Aug 24 11:07:15
Finished on | Aug 24 11:08:50
Mapping speed, Million of reads per hour | 140.17
Number of input reads | 3698967
Average input read length | 208
UNIQUE READS:
Uniquely mapped reads number | 694529
Uniquely mapped reads % | 18.78%
Average mapped length | 168.66
Number of splices: Total | 50072
Number of splices: Annotated (sjdb) | 44925
Number of splices: GT/AG | 40642
Number of splices: GC/AG | 2603
Number of splices: AT/AC | 65
Number of splices: Non-canonical | 6762
Mismatch rate per base, % | 1.96%
Deletion rate per base | 0.05%
Deletion average length | 1.42
Insertion rate per base | 0.02%
Insertion average length | 1.16
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 341923
% of reads mapped to multiple loci | 9.24%
Number of reads mapped to too many loci | 756
% of reads mapped to too many loci | 0.02%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 71.90%
% of reads unmapped: other | 0.06%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00
Many thanks, Mani
Hi Christoph,
In the SE mode, the pipeline ran without errors. The last few lines of the terminal output are:
[bam_sort_core] merging from 0 files and 24 in-memory blocks...
[1] "Here are the detected subsampling options:"
[1] "Automatic downsampling"
[1] "Working on barcode chunk 1 out of 1"
[1] "Processing 414 barcodes in this chunk..."
[1] "2018-08-24 23:35:09 BST"
[1] "I am done!! Look what I produced.../SPLiTseq_zUMI_20180824/zUMIs_output/"
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 4958931 264.9 9968622 532.4 9968622 532.4
Vcells 12865137 98.2 29706687 226.7 57985435 442.4
Descriptive statistics...
[1] "I am loading useful packages for plotting..."
[1] "2018-08-24 23:35:11 BST"
Warning messages:
1: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
2: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
3: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
4: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
Warning messages:
1: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
2: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
3: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
4: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
^MRead 53.3% of 3698967 rows^MRead 79.5% of 3698967 rows^MRead 3698967 rows and 3 (of 3) columns from 0.179 GB file in 00:00:04
Warning message:
In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 1414295 75.6 4703850 251.3 4703850 251.3
Vcells 2797433 21.4 47254341 360.6 49550965 378.1
Please see below the STAR final log in the SE mode:
Started job on | Aug 24 23:24:39
Started mapping on | Aug 24 23:30:58
Finished on | Aug 24 23:32:16
Mapping speed, Million of reads per hour | 170.72
Number of input reads | 3698967
Average input read length | 151
UNIQUE READS:
Uniquely mapped reads number | 1059890
Uniquely mapped reads % | 28.65%
Average mapped length | 142.63
Number of splices: Total | 64795
Number of splices: Annotated (sjdb) | 55839
Number of splices: GT/AG | 50728
Number of splices: GC/AG | 3307
Number of splices: AT/AC | 99
Number of splices: Non-canonical | 10661
Mismatch rate per base, % | 2.33%
Deletion rate per base | 0.05%
Deletion average length | 1.53
Insertion rate per base | 0.03%
Insertion average length | 1.24
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 363704
% of reads mapped to multiple loci | 9.83%
Number of reads mapped to too many loci | 119
% of reads mapped to too many loci | 0.00%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 61.46%
% of reads unmapped: other | 0.06%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
Many thanks, Mani
Hey Mani,
sorry for the slow answer! That looks like there is some bug in the PE read counting. I will run some test data today on my side to find out more.
Apart from that it looks like in your case the addition of the mate makes things worse anyway, so maybe for this dataset I would recommend you SE analysis anyway! Since you have very long reads, I would also recommend to clip poly-A tails with star:
additional_STAR_params: --clip3pAdapterSeq AAAAAAAAAAA
I will update here once I have found the bug!
Best, Christoph
Hi Christoph,
Thank you. Indeed, trimming of reads might benefit.
One more thing I previously noted was that the pipeline fell through when I used _HamDist: 1. I don't know the error was related to the PE mode, but I will test it again in the SE mode. It would be good if you could please check that as well.
I appreciate your help very much.
Many thanks, Mani
Hey Mani,
Both the PE & Hamming distance tests ran through fine for me.
Here is the log of the PE data:
Filtering...
Mapping...
Aug 25 19:10:23 ..... started STAR run
Aug 25 19:10:23 ..... loading genome
Aug 25 19:14:05 ..... processing annotations GTF
Aug 25 19:14:32 ..... inserting junctions into the genome indices
Aug 25 19:18:44 ..... started mapping
Aug 25 20:12:18 ..... finished successfully
Counting...
[1] "2018-08-25 20:12:36 CEST"
[1] "31 barcodes detected."
[1] "5467393 reads were assigned to barcodes that do not correspond to intact cells."
[1] "2.25e+08 Reads per chunk"
[1] "Loading reference annotation from:"
[1] "/mnt/kauffman/chrisz/resources/genomes/Human/Homo_sapiens.GRCh38.91.chr.gtf"
[1] "Annotation loaded!"
[1] "Assigning reads to features (ex)"
========== _____ _ _ ____ _____ ______ _____
===== / ____| | | | _ \| __ \| ____| /\ | __ \
===== | (___ | | | | |_) | |__) | |__ / \ | | | |
==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
Rsubread 1.30.3
//========================== featureCounts setting ===========================\\
|| ||
|| Input files : 1 BAM file ||
|| P PE.filtered.tagged.Aligned.out.bam ||
|| ||
|| Output file : .Rsubread_featureCounts_pid18423 ||
|| Summary : .Rsubread_featureCounts_pid18423.summary ||
|| Annotation : .Rsubread_UserProvidedAnnotation_pid18423 (SAF) ||
|| Assignment details : <input_file>.featureCounts.bam ||
|| (Note that files are saved to the output directory) ||
|| ||
|| Dir for temp files : . ||
|| Threads : 24 ||
|| Level : meta-feature level ||
|| Paired-end : yes ||
|| Strand specific : no ||
|| Multimapping reads : not counted ||
|| Multiple alignments : primary alignment only ||
|| Multi-overlapping reads : not counted ||
|| Min overlapping bases : 1 ||
|| ||
|| Chimeric reads : not counted ||
|| Both ends mapped : not required ||
|| ||
\\===================== http://subread.sourceforge.net/ ======================//
//================================= Running ==================================\\
|| ||
|| Load annotation file .Rsubread_UserProvidedAnnotation_pid18423 ... ||
|| Features : 1199596 ||
|| Meta-features : 49558 ||
|| Chromosomes/contigs : 25 ||
|| ||
|| Process BAM file PE.filtered.tagged.Aligned.out.bam... ||
|| Paired-end reads are included. ||
|| Assign fragments (read pairs) to features... ||
|| Total fragments : 258524806 ||
|| Successfully assigned fragments : 89699733 (34.7%) ||
|| Running time : 3.70 minutes ||
|| ||
|| Read assignment finished. ||
|| ||
\\===================== http://subread.sourceforge.net/ ======================//
[bam_sort_core] merging from 72 files and 24 in-memory blocks...
[1] "Assigning reads to features (in)"
========== _____ _ _ ____ _____ ______ _____
===== / ____| | | | _ \| __ \| ____| /\ | __ \
===== | (___ | | | | |_) | |__) | |__ / \ | | | |
==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
Rsubread 1.30.3
//========================== featureCounts setting ===========================\\
|| ||
|| Input files : 1 BAM file ||
|| P PE.filtered.tagged.Aligned.out.bam ||
|| ||
|| Output file : .Rsubread_featureCounts_pid18423 ||
|| Summary : .Rsubread_featureCounts_pid18423.summary ||
|| Annotation : .Rsubread_UserProvidedAnnotation_pid18423 (SAF) ||
|| Assignment details : <input_file>.featureCounts.bam ||
|| (Note that files are saved to the output directory) ||
|| ||
|| Dir for temp files : . ||
|| Threads : 24 ||
|| Level : meta-feature level ||
|| Paired-end : yes ||
|| Strand specific : no ||
|| Multimapping reads : not counted ||
|| Multiple alignments : primary alignment only ||
|| Multi-overlapping reads : not counted ||
|| Min overlapping bases : 1 ||
|| ||
|| Chimeric reads : not counted ||
|| Both ends mapped : not required ||
|| ||
\\===================== http://subread.sourceforge.net/ ======================//
//================================= Running ==================================\\
|| ||
|| Load annotation file .Rsubread_UserProvidedAnnotation_pid18423 ... ||
|| Features : 275839 ||
|| Meta-features : 28235 ||
|| Chromosomes/contigs : 24 ||
|| ||
|| Process BAM file PE.filtered.tagged.Aligned.out.bam... ||
|| Paired-end reads are included. ||
|| Assign fragments (read pairs) to features... ||
|| Total fragments : 258524806 ||
|| Successfully assigned fragments : 30945865 (12.0%) ||
|| Running time : 4.70 minutes ||
|| ||
|| Read assignment finished. ||
|| ||
\\===================== http://subread.sourceforge.net/ ======================//
[bam_sort_core] merging from 72 files and 24 in-memory blocks...
[1] "Here are the detected subsampling options:"
[1] "Automatic downsampling"
[1] "Working on barcode chunk 1 out of 2"
[1] "Processing 24 barcodes in this chunk..."
[1] "Working on barcode chunk 2 out of 2"
[1] "Processing 7 barcodes in this chunk..."
[1] "2018-08-25 23:02:31 CEST"
[1] "I am done!! Look what I produced.../mnt/kauffman/chrisz/test_zUMIs/PE//zUMIs_output/"
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 5169763 276.1 12002346 641.0 12002346 641.0
Vcells 852538194 6504.4 3937035931 30037.3 4921272378 37546.4
Would you like to upload a million reads of your test data for me to test on?
Best,
Christoph
Hi Christoph,
Thanks very much for testing the PE mode and the hamming distance options. Good to know that the tests went fine. I don't know if the version differences in Rsubread could cause the issue. Between versions 1.28 and 1.30, the following new features have been added in Rsubread:
o New parameters in featureCounts() to give more control on the size of overlap between read and feature - 'nonOverlap' and 'nonOverlapFeature'.
o New parameter in featureCounts() to specify the path where files containing counting results for each read are saved - 'reportReadsPath'.
o New parameter in align() and subjunc() to allow reads in the mapping output to have the same order as in the FASTQ file - 'keepReadOrder'.
Unless you use these options, it shouldn't make any difference. I am currently using R 3.4.3 and Bioconductor 3.6. To update Rsubread, I should update R and Bioconductor. I will test if this will make any difference.
I will send you a million reads after obtaining necessary approval.
Many thanks, Mani
Good! We don't use any of these new options but sometimes there can be small updates that don't end up in the changelog. To our knowledge 1.28.1 should be fine though.
Best, Christoph
Hi Christoph,
I am still waiting for approval to share a million reads with you. I will update you as soon as I can.
Meanwhile, there seems to be an issue with the reads2genes function. The count matrix generated by the pipeline for introns has the concatenated barcodes as the gene ids instead of the actual gene id. The exons count matrix has the correct gene ids.
I tried to debug the issue and found that splitting of the annotation into exons and introns “Get non-overlapping Introns/Exons" - (*.annotationsSAF.rds) seems OK:
> annot_lists <- readRDS(file = "zUMIs_output/expression/TSBC07_test_gtf.annotationsSAF.rds");
> head(annot_lists$introns$GeneID)
[1] "ENSG00000223972.5" "ENSG00000223972.5" "ENSG00000223972.5"
[4] "ENSG00000227232.5" "ENSG00000227232.5" "ENSG00000227232.5"
Similarly, it seems OK in the filtered.tagged.Aligned.out.bam.in.featureCounts.bam file:
NS500781:295:H3JC3AFXY:1:11101:1111:11430 16 chr22 35828896 255 1S150M * 0 0 AAAAAAATTAGCCAGGCGTGGTGGCACGCACCTGTAATCCCCGCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCCGGGAGGCAGAGGTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCC /EEEEA/A//<EA6EEEE<<A/AA/6/AA/EEAE/<A//<//EEA///EAEE/E<A/A<AEE//EAE<E/A////EEEEEAE/<EEA<<AAE//EEAA</AA/AE/EAEEEEEEEEEAEEAEE/EEEAEE<EEEEEEEEEEEEEAEAAAAA NH:i:1 HI:i:1 AS:i:144 nM:i:2 BC:Z:CGCTGATCATGAGATCCATCTAGTCAGATC UB:Z:CTCCCGGTTC XS:Z:Assigned XN:i:1 XT:Z:ENSG00000100320.22
NS500781:295:H3JC3AFXY:1:11101:1111:15628 4 * 0 0 * * 0 0 GCATAGTAGTCGTACGCCGATGCGAAACATCGGCCACTCTGCTGTCCTTTTTTATAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCGCGTATGCCGGCGTCGGCGTGAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGG /AAAAEEEEEAEEAEEEEEEEAEEEEEEEEEEEEE<EEE/AEE<EEEEEEEEEEEEEAEAEEEEEEEE/EEEA<E/EE/EEEE/E/E/EA/EAEE/<////<////A////<//EA<E</<<EEA/////EA/AEEEEEEEAEEEEEEAE/ NH:i:0 HI:i:0 AS:i:20 nM:i:2 uT:A:1 BC:Z:ACAGCAGATCTAAGCCGAGCGATCCAGATC UB:Z:ATAAAAAAGG XS:Z:Unassigned_Unmapped
NS500781:295:H3JC3AFXY:1:11101:1111:17616 4 * 0 0 * * 0 0 GATTTGGTATGAGTCGTACGCCGATGCGAAACATCGGCCACTCGTTAGCCGGCTGGGATAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCGTGAAAAAAAAAAGGGGGGGGGGGGGGGGGG AAAAAEAEEEEEEEEAEAEEEEE/EEEEEAEEEEEEEEE/AAEE6AE/E/AAEEEEE/EEEAEEEEE<EEE<EEEEEEE<EEAEEEAEEEAAEE/AEEEA////E//A/6/A//A<<<AE//AA<AE////6//<6AAEAEEEEEEEEE<< NH:i:0 HI:i:0 AS:i:17 nM:i:0 uT:A:1 BC:Z:GCTAACGACATACCAACGACGAGCCAGATC UB:Z:ATCCCAGCCG XS:Z:Unassigned_Unmapped
NS500781:295:H3JC3AFXY:1:11101:1111:17698 16 chr17 688000 255 2S149M * 0 0 TTCCTTCACTGTACATAGAGGAGAGAAGATAGCTATTTGGATTCCTTATAGATTTTCTAAAAATGATGGATTTTAGAGGATACGTGGCAAAGTGTTCCAAATATGCCGCAGACATGTCTCCCTGCCAAAAAAGCGCTATCAAGGTGATATG /A/<</<//<6EE6<//AEE/<A<AAEEE<A6<E<EAA/<6<E/EEEA/A<//////EEA/EEEEE/EEEEEEAE<E/EE/A/E/<6/AEEEE<6A/EAEEEEEEE/AAEAEEEAEEEEEEEEEEAEEE/EE/EEEEE/EEEEAEEAAAAA NH:i:1 HI:i:1 AS:i:143 nM:i:2 BC:Z:CGACACACAAGACGGAAACCGAGACAGATC UB:Z:CTGTCTGCGC XS:Z:Assigned XN:i:1 XT:Z:ENSG00000141252.19
However, the ‘reads’ object from the reads2genes function has the concatenated barcodes instead of gene ids for introns:
"RG" "UB" "GE" "ftype"
"1" "AAACATCGACACAGAAAAGACGGACAGATC" "CCCACCCACC" "AAACATCGACACAGAAAAGACGGACAGATC UB" "intron"
"2" "AAACATCGACACAGAAAAGACGGACAGATC" "CCGCTCATCC" "ENSG00000243468.5" "exon"
"3" "AAACATCGACACAGAAAAGACGGACAGATC" "CCCGGCGCCC" "ENSG00000100376.11" "exon"
"4" "AAACATCGACACAGAAAAGACGGACAGATC" "CTTACCTCTT" "AAACATCGACACAGAAAAGACGGACAGATC UB" "intron"
"5" "AAACATCGACACAGAAAAGACGGACAGATC" "CGCACCAGCC" "AAACATCGACACAGAAAAGACGGACAGATC UB" "intron"
"6" "AAACATCGACACAGAAAAGACGGACAGATC" "TCGTATTTCG" "ENSG00000281181.1" "exon"
"7" "AAACATCGACACAGAAAAGACGGACAGATC" "GTGGTTGCTT" "AAACATCGACACAGAAAAGACGGACAGATC UB" "intron"
"8" "AAACATCGACACAGAAAAGACGGACAGATC" "CCTCTCCCCC" "AAACATCGACACAGAAAAGACGGACAGATC UB" "intron"
"9" "AAACATCGACACAGAAAAGACGGACAGATC" "CGCACCAGCC" "AAACATCGACACAGAAAAGACGGACAGATC UB" "intron"
"10" "AAACATCGACACAGAAAAGACGGACAGATC" "CTCTCGATCC" "AAACATCGACACAGAAAAGACGGACAGATC UB" "intron"
Here is the output from “.dgecounts.rds”:
> countMatrix_lists <- readRDS(file = "zUMIs_output/expression/TSBC07_test_gtf.dgecounts.rds");
> dge_umi_exon_all <- as.data.frame(as.matrix(countMatrix_lists$umicount$exon$all));
Loading required package: Matrix
> dge_umi_intron_all <- as.data.frame(as.matrix(countMatrix_lists$umicount$intron$all));
> dge_umi_exon_all[1:6, 1:6];
AAACATCGACACAGAAAAGACGGACAGATC
ENSG00000000419.12 0
ENSG00000000460.16 0
ENSG00000001084.10 0
ENSG00000001461.16 0
ENSG00000001497.16 0
ENSG00000001629.9 0
AAACATCGATAGCGACAAGACGGACAGATC
ENSG00000000419.12 0
ENSG00000000460.16 0
ENSG00000001084.10 0
ENSG00000001461.16 0
ENSG00000001497.16 0
ENSG00000001629.9 0
AAACATCGTAGGATGAAGTACAAGCAGATC
ENSG00000000419.12 0
ENSG00000000460.16 0
ENSG00000001084.10 0
ENSG00000001461.16 0
ENSG00000001497.16 0
ENSG00000001629.9 0
AACAACCAAACGCTTAAAACATCGCAGATC
ENSG00000000419.12 0
ENSG00000000460.16 0
ENSG00000001084.10 0
ENSG00000001461.16 0
ENSG00000001497.16 0
ENSG00000001629.9 0
AACAACCAAAGAGATCAACGTGATCAGATC
ENSG00000000419.12 0
ENSG00000000460.16 0
ENSG00000001084.10 0
ENSG00000001461.16 0
ENSG00000001497.16 0
ENSG00000001629.9 0
AACAACCATAGGATGAAACGCTTACAGATC
ENSG00000000419.12 0
ENSG00000000460.16 0
ENSG00000001084.10 0
ENSG00000001461.16 0
ENSG00000001497.16 0
ENSG00000001629.9 0
> dge_umi_intron_all[1:6, 1:6];
AAACATCGACACAGAAAAGACGGACAGATC
AAACATCGACACAGAAAAGACGGACAGATC\tUB 126
AAACATCGATAGCGACAAGACGGACAGATC\tUB 0
AAACATCGTAGGATGAAGTACAAGCAGATC\tUB 0
AACAACCAAACGCTTAAAACATCGCAGATC\tUB 0
AACAACCAAAGAGATCAACGTGATCAGATC\tUB 0
AACAACCAAAGAGATCCGACGAGCCAGATC\tUB 0
AAACATCGATAGCGACAAGACGGACAGATC
AAACATCGACACAGAAAAGACGGACAGATC\tUB 0
AAACATCGATAGCGACAAGACGGACAGATC\tUB 160
AAACATCGTAGGATGAAGTACAAGCAGATC\tUB 0
AACAACCAAACGCTTAAAACATCGCAGATC\tUB 0
AACAACCAAAGAGATCAACGTGATCAGATC\tUB 0
AACAACCAAAGAGATCCGACGAGCCAGATC\tUB 0
AAACATCGTAGGATGAAGTACAAGCAGATC
AAACATCGACACAGAAAAGACGGACAGATC\tUB 0
AAACATCGATAGCGACAAGACGGACAGATC\tUB 0
AAACATCGTAGGATGAAGTACAAGCAGATC\tUB 355
AACAACCAAACGCTTAAAACATCGCAGATC\tUB 0
AACAACCAAAGAGATCAACGTGATCAGATC\tUB 0
AACAACCAAAGAGATCCGACGAGCCAGATC\tUB 0
AACAACCAAACGCTTAAAACATCGCAGATC
AAACATCGACACAGAAAAGACGGACAGATC\tUB 0
AAACATCGATAGCGACAAGACGGACAGATC\tUB 0
AAACATCGTAGGATGAAGTACAAGCAGATC\tUB 0
AACAACCAAACGCTTAAAACATCGCAGATC\tUB 176
AACAACCAAAGAGATCAACGTGATCAGATC\tUB 0
AACAACCAAAGAGATCCGACGAGCCAGATC\tUB 0
AACAACCAAAGAGATCAACGTGATCAGATC
AAACATCGACACAGAAAAGACGGACAGATC\tUB 0
AAACATCGATAGCGACAAGACGGACAGATC\tUB 0
AAACATCGTAGGATGAAGTACAAGCAGATC\tUB 0
AACAACCAAACGCTTAAAACATCGCAGATC\tUB 0
AACAACCAAAGAGATCAACGTGATCAGATC\tUB 1036
AACAACCAAAGAGATCCGACGAGCCAGATC\tUB 0
AACAACCAAAGAGATCCGACGAGCCAGATC
AAACATCGACACAGAAAAGACGGACAGATC\tUB 0
AAACATCGATAGCGACAAGACGGACAGATC\tUB 0
AAACATCGTAGGATGAAGTACAAGCAGATC\tUB 0
AACAACCAAACGCTTAAAACATCGCAGATC\tUB 0
AACAACCAAAGAGATCAACGTGATCAGATC\tUB 0
AACAACCAAAGAGATCCGACGAGCCAGATC\tUB 468
>
Could you please check this? Many thanks, Mani
Hey Mani,
yes I am looking into this. So it seems to me that while reading in from the bam file something goes awry. In my Split-seq test the intron table and reads object are as expected.
dge_umi_intron_all[1:6, 1:6];
AAACATCGACACAGAAACAGCAGA AAACATCGACACAGAATGGTGGTA
ENSG00000000003 0 0
ENSG00000000419 0 0
ENSG00000000457 0 0
ENSG00000000460 0 0
ENSG00000000938 0 0
ENSG00000000971 0 0
AAACATCGCCTCTATCCACCTTAC AAACATCGCCTCTATCCCAGTTCA
ENSG00000000003 0 0
ENSG00000000419 0 0
ENSG00000000457 0 0
ENSG00000000460 0 0
ENSG00000000938 0 0
ENSG00000000971 0 0
AACAACCAATCCTGTAAGGCTAAC AACAACCAATCCTGTAGACTAGTA
ENSG00000000003 0 0
ENSG00000000419 0 2
ENSG00000000457 0 2
ENSG00000000460 0 1
ENSG00000000938 0 0
ENSG00000000971 0 0
Now that I look at your verbose (thanks for posting all the information!!), it seems like the field delimiting is not correct. I could imagine that data.table::fread gets thrown off by the . in the read names. To check this further:
I will try to reconstruct this problem (hope i get around to it this week).
maybe its easily fixed by adding , sep="\t"
in line 63 of the reading function
Since you seem pretty proficient in R, you are also welcome to try that!
Hi Christoph,
Thank you very much for the quick reply and the suggestions.
I am using data.table 1.11.5. Indeed, I checked the line 63 as it is where the issue arises. I think this line is fine:
samtools view -x NH -x AS -x nM -x HI -x IH -x NM -x uT -x MD -x jM -x jI -x XN -x XS -@ 1 TSBC07_test_gtf.filtered.tagged.Aligned.out.bam.in.featureCounts.bam | cut -f12,13,14 | grep -F CTGTAGCCGACAGTGCCAGATCTGCAGATC | sed 's/XT:Z://' | less
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC UB:Z:GCCCGCCCCA
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC UB:Z:GTTTGCTCTT ENSG00000171988.18
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC UB:Z:TCCTTTCCCC
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC UB:Z:CTGGGATGAT
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC UB:Z:CCTCACCTCC
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC UB:Z:CCATGGCACG
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC UB:Z:CCGCCCCGGC
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC UB:Z:GTTTTATACT ENSG00000123106.10
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC UB:Z:CCCCCCCCTC
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC UB:Z:CGGGGTAGGT
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC UB:Z:GGGCTGTTGT
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC UB:Z:CCGGCCGCTG ENSG00000002822.15
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC UB:Z:GGCTTCGGGG
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC UB:Z:CTGGGCTTGG ENSG00000086200.16
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC UB:Z:CCCGCTCCGG
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC UB:Z:GTCAGGCTTA ENSG00000144229.11
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC UB:Z:TGTTTGTTGC
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC UB:Z:GTCCCTCTCG
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC UB:Z:CTCTTATCAC
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC UB:Z:CGCCCTGTCC
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC UB:Z:CTGGGCTTGG ENSG00000086200.16
BC:Z:CTGTAGCCGACAGTGCCAGATCTGCAGATC UB:Z:TCGCTTTCTT ENSG00000021574.11
Unfortunately, adding sep = “\t” didn’t help. I also thought that the Gencode gtf with suffix .xx (Ensembl gene versions) could be the issue. However, the count table for exons looks OK. The issue is only with the intron table. Anyway, I have tested with different gtf files, and they produce the same issue.
Many thanks, Mani
What is your command and config to analysis split-seq data using zUMI? It seems that I have gone through this problem.
Hi Christoph,
I think I have traced the source of the issue. Indeed the issue is due to an old version of data.table that was called by the ‘Rscript’ function used to invoke R.
I have multiple R installations on the server and I usually use the installation in my home directory (which I used for debuging - and it looked fine obviously!). I have included the R-3.4.3/bin/R
in my $PATH
which is fine for invoking R by entering R
in the terminal or in the CMD mode. However, since zUMIs uses #!/usr/bin/env Rscript
it was calling R from the installation in the system root directories. I have now included R-3.4.3/bin/Rscript
in my $PATH
, and it seems running OK in the SE mode. I still have errors while trying to use the PE mode (it went fine for a subset of reads I used for testing though!). I will post an update as soon as possible.
Many thanks, Mani
Hey Mani,
okay good it sounds that you are coming closer to the source of the problem. Maybe we can give a short verbose to users which R is starting up in each zUMIs call from the next update. Keep me posted & let me know if we can help more.
Hey Mani,
Since you did not get back with the issue, I am assuming you have found a solution. I am closing this issue for now but do not hesitate to re-open it if need be.
Thank you Swati
Hi, Thanks for the zUMIs pipeline. I am using the current version (version 2.0.4) to analyze a SPLiT-Seq dataset, and it is giving the following error during the counting stage:
Error in sample.int(length(x), size, replace, prob) : vector size cannot be NA Calls: collectCounts ... [.data.table -> sample -> sample -> -> sample.int
In addition: There were 31 warnings (use warnings() to see them)
Execution halted.
Here is the full terminal output:
It looks there are issues in executing the collectCounts function or in its input (reads) from the preceding function reads2genes. Any ideas?
Many thanks, Mani