sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
274 stars 67 forks source link

Empty aligned bam #129

Closed cpchaney closed 5 years ago

cpchaney commented 5 years ago

Describe the bug Alignment is unsuccessful so that humphreys_human_snuc.filtered.tagged.Aligned.out.bam contains no sequences (samtools view humphreys_human_snuc.filtered.tagged.Aligned.out.bam | head -n 10 produces no output)

To Reproduce Provide the YAML file and full standard output. YAML: project: humphreys_human_snuc sequence_files: file1: name: /obfuscated/Documents/ncbi/public/fastq/SRR7648415_1.fastq.paired.fq.gz base_definition:

stdout: [W::bam_merge_core2] No @HD tag found. [E::sam_parse1] unrecognized type 7 [W::sam_read1] Parse error at line 198 [main_samview] truncated file. Error in (function (cl, name, valueClass) : assignment of an object of class “numeric” is not valid for @‘Dim’ in an object of class “dgTMatrix”; is(value, "integer") is not TRUE Calls: convert2countM -> .makewide -> -> Execution halted Warning message: In fread(gtf, select = 1:2, header = F) : Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning. Error in gzfile(file, "rb") : cannot open the connection Calls: readRDS -> gzfile In addition: Warning message: In gzfile(file, "rb") : cannot open compressed file '/obfuscated/Research/RBK/output/humphreys_human_snuc/zUMIs_output/expression/humphreys_human_snuc.dgecounts.rds', probable reason 'No such file or directory' Execution halted

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Additional context The dataset was generated by sequencing human single nuclei. The data are available from GEO under the accession GSE118184. The libraries were prepared for sequencing using the 10X Chromium system. The FASTQ files were downloaded from GEO with fastq-dump and preprocessed with fastq_pair to make sure that all reads have a mate.

The following additional observations might be informative.

Aug 05 07:51:42 ..... started STAR run Aug 05 07:51:45 ..... loading genome Aug 05 07:56:09 ..... processing annotations GTF Aug 05 07:56:39 ..... inserting junctions into the genome indices Aug 05 08:01:01 ..... started 1st pass mapping Aug 05 08:01:01 ..... finished 1st pass mapping Aug 05 08:01:02 ..... inserting junctions into the genome indices Aug 05 08:02:45 ..... started mapping Aug 05 08:02:46 ..... finished successfully

cziegenhain commented 5 years ago

Hey,

Sorry that you are encountering issues!

It looks like none of the reads come out of the filtering step, hence the failures & instant mapping of 0 reads downstream. Could you check the content of the following output files : .BCstats.txt .filtered.tagged.unmapped.bam

This should tell us more about how the errors come about.

It could be that something is odd with the fastq files - although me and others have used data from GEO before but you never know with public data. (if you could send me a head of the fastq files too it would help to see if the read ID line contains something unexpected maybe!)

Regarding the read_layout: SE is correct as there is only one cDNA read so zUMIs detects that ;-)

Best, Christoph

PS: I changed the title of the issue so it's a bit more traceable.

cpchaney commented 5 years ago

Christoph,

Here is the requested information. Thanks for everything.

Cheers,

Chris

wc -l .BCstats.txt 3800921 .BCstats.txt

head -n 10 *.BCstats.txt GCGCGATGTATCGCAT 1 GAAATGACATTCTTAC 29 TATCAGACATCAGTAC 1 TTAGGCATCAACGGGA 28 GCTGCTTGTCCCGACA 2 ATCTGCCTCTCAGATT 2 GTAGGGTGTAGCTTGT 2 CCATTCAAGCGTGAGT 1 CGGTTAACAGACAGGT 1 GACCAATGTAACGTTC 25

ls -atl .filtered.tagged.unmapped.bam -rw-r--r-- 1 6982881681 Aug 5 07:58 .filtered.tagged.unmapped.bam

samtools view .filtered.tagged.unmapped.bam | head -n 2 SRR7648415.6 SN997:991:H3HKKBCX2:2:1101:1670:2140 4 0 0 * 0 0 ATCATGGCCACAAGGGTGCTTGTGTCACCCTATCCCCAGCTTCAGGTGGCTCAGTACAGAGAGAGAAACTTTTGTTTGGGGAACAAGAGTCTGCCCGG <DB@@<1<11<C0DGD@HHHIIICCHCG?FHGHHHHG1CFH1FHHHHIHHCF1CG@GHG1C1C<0C1CHHEG?HGE@1DEH0<D1<FG1DGEE1DGHC BC:Z:TGGCCAGCATCTCATA UB:Z:GCGTCATTTT SRR7648415.11 SN997:991:H3HKKBCX2:2:1101:1981:2197 4 0 0 ** 0 0 GAAATCCATGACGCAGGGAGAATTGCGTCATTTAAAGCCTAGTTAACGCATTTACTAAAAGCAGACGAAAAAGGAAAGATTAATTGGGAGAGAAAGGA 0<<<?CF@1<GH00<./<CE=G@1110<<E<<FEH@1C1<CE<1CF10/<0<1<<1D<F111<1<1<C/<?1<0<<F1<<1DH@1111D<0<11<1<< BC:Z:AATCGGTCAAGCCGCT UB:Z:GGCGTTTTTT

zcat ~/Documents/ncbi/public/fastq/SRR7648415_1.fastq.paired.fq.gz | head -10 @SRR7648415.1 SN997:991:H3HKKBCX2:2:1101:1142:2047 length=26 NTTCGGGGTTCGGCACCAACCGGCGT +SRR7648415.1 SN997:991:H3HKKBCX2:2:1101:1142:2047 length=26

<DDDHIIIHIIIHHIIIHHHHIIIH

@SRR7648415.1 SN997:991:H3HKKBCX2:2:1101:1142:2047 length=26 NTTCGGGGTTCGGCACCAACCGGCGT +SRR7648415.1 SN997:991:H3HKKBCX2:2:1101:1142:2047 length=26

<DDDHIIIHIIIHHIIIHHHHIIIH

@SRR7648415.3 SN997:991:H3HKKBCX2:2:1101:1446:2119 length=26 NGTGAGGCAGGACGTAGCGATAAGGG

zcat ~/Documents/ncbi/public/fastq/SRR7648415_2.fastq.paired.fq.gz | head -10 @SRR7648415.1 SN997:991:H3HKKBCX2:2:1101:1142:2047 length=98 AGGAAGGGCTGAGGGTTCGNNCNNTNNNNNNNNCCCNNNNNNNNNNCGNNNNNNNNNNNNNNNNNNNNNNNCCNNNNNNNNNNNNNNNNNNNNNNNNN +SRR7648415.1 SN997:991:H3HKKBCX2:2:1101:1142:2047 length=98 D@B@<F0<C1E?GHHECGH##<##1########<<<##########<<#######################<<######################### @SRR7648415.2 SN997:991:H3HKKBCX2:2:1101:1176:2049 length=98 GTCTCTGTTAACTCTATAGGAAGCTTTTCTTTGGCGAATCAGTGTCACAAAATAACTCCAGAAAGAAGCACTTAGCGTGCTGTTCCTGTGGAAAAACT +SRR7648415.2 SN997:991:H3HKKBCX2:2:1101:1176:2049 length=98 <B0B<F1<<FH?0<<CDF1<CHHE11111<<<D11<C//<D1<1<<F@F1DHHIG@1C@HFEHH@HIH?<DGHE1@C0<@C1C?111<<1<FEHIIIC @SRR7648415.3 SN997:991:H3HKKBCX2:2:1101:1446:2119 length=98 AACTGCCCTAGCCCACTTCTTACCACAAGGCACACCTACACCCATTATCCCCATACCAGTTCTTATAGACACCCTCAGCCTACTCATTCAACCAATAA

cziegenhain commented 5 years ago

Okay good thats encouraging! Seems like the barcode extraction works and there are reads being spit out prior to mapping.

I have experienced problems with these long read ids from fastq-dump before, can you rerun the dumping and only extract the original read IDs? that should be the correct flag to use: --origfmt --defline-qual '+'

Also, could you count the number of lines in the *.filtered.tagged.unmapped.bam and check against the number of reads in the fastq? Since the BC & UMI sequence quality gets filtered it should be less reads than you start with but not dramatically!

cpchaney commented 5 years ago

I'm dumping the fastq files with the proposed flags.

zcat ~/Documents/ncbi/public/fastq/SRR7648415_2.fastq.paired.fq.gz | wc -l 362367408

samtools view *.filtered.tagged.unmapped.bam | wc -l 70159330

Thus, *.filtered.tagged.unmapped.bam contains 70159330/(362367408/4) = 77% as many sequences as there are reads in the fastq.

cziegenhain commented 5 years ago

Excellent! That sounds reasonable. I am fairly confident the newly dumped files should get rid of your issue, so fingers crossed!

Note: You could be less stringent in the filtering if you want to retain more reads by changing the parameters in filter_cutoffs:

cziegenhain commented 5 years ago

Since you didn't get back I am assuming your issue was solved. Feel free to reopen if you need further assistance.