Duplicated sequence name error

ohlerlab / RiboseQC

Pipeline for Quality Control of Ribo-Seq data, selection of P-site offsets, and codon usage statistics.

15 stars 14 forks source link

Duplicated sequence name error #14

Open alexamlie opened 3 years ago

alexamlie commented 3 years ago

I am using Ribo-seQC (great package!) on human ribosome sequencing data against Gencode v35 annotations, generated following the instructions. When I run the main RiboseQC_analysis function, the annotation file loading and read length extraction runs successfully, but in the output following 'analyzing BAM file', I get this error:

Error in asMethod(object) : names of RleList object to coerce to GRanges cannot contain duplicated sequence names Calls: RiboseQC_analysis ... .reduceByYield_iterate -> REDUCE -> GRanges -> as -> asMethod In addition: There were 50 or more warnings (use warnings() to see the first 50) Execution halted

I cannot figure out where this error is coming from, besides that it is raised by the GenomicRanges package. Has anybody run into this before, or any idea where it came from? The bam file itself does not have any duplicated read names in it, so the 'sequence names' must refer to something else..

zslastman commented 3 years ago

Hi Alex, thanks for using the package. I'm not sure if this is due to novel annotation/bam files, or if a Bioconductor update has broken something. It's hard for me to debug without knowing where exactly in the code it's coming from - it would be helpful if you could show me the output of traceback(), as well as your sessionInfo().

I'd also be curious to know if this is specific to the annotation you're using - you could try with gencode 18 (e.g.) and see if the error still occurs. I'd also be curious to know if it's dependent on the content of the bam file. I.e. does it occur when you subset the bam file to a single chromosome name.

alexamlie commented 3 years ago

Hi, thanks for the prompt response. This was happening on both GENCODE v29 and v35 annotation files, but when I tried your suggestion of subsetting the bam files to a single chromosome it ran successfully. Then I realized I had done this mapping to hg38 including all the scaffold annotations (the GL and KI and KV and all those unassembled contigs), and when I reran it on the canonical chromosomes only, it worked!

So I guess even though these reads were unique mappers and deduplicated, there might be some duplication in the annotated chromosomes themselves. Weirdly enough I'm still using the same GENCODE annotations which don't have anything on those unassembled contigs, but I guess having them in the read files was enough to throw that GRanges error.

zslastman commented 3 years ago

Aha, thanks for letting me know how it turned out! I wonder if we trimmed chromosome names somewhere in such a way as to make scaffolds redundantly named, our tests all used the canonicals for speeds sake. Are there actual sequence duplications in your bam file? samtools view -H $samfile | grep SQ | sort | uniq