ohlerlab / RiboseQC

Pipeline for Quality Control of Ribo-Seq data, selection of P-site offsets, and codon usage statistics.
15 stars 14 forks source link

Calculating P-sites fails #15

Open imilenkovic opened 3 years ago

imilenkovic commented 3 years ago

Hello,

I tried running RiboseQC on bam files obtained with another RiboSeq analysis pipeline called riboflow.

The analysis fails at the P-site step:

Calculating P-sites offsets ... Mon Jun 28 20:39:29 2021 Calculating P-sites offsets --- Done! Mon Jun 28 20:39:29 2021 Calculating P-sites positions and junctions ... Mon Jun 28 20:39:29 2021 Calculating P-sites positions and junctions --- Done! Mon Jun 28 20:40:34 2021 Building aggregate P-sites profiles ... Mon Jun 28 20:40:34 2021 [1] "Not enough signal or low frame preference, skipped P-sites & codon occupancy calculation. \n Set 'all' in the 'choose_readlengths' choice to ignore this warning and get P-sites positions in a new run" Building aggregate P-sites profiles --- Done! Mon Jun 28 20:40:34 2021 Exporting files ... Mon Jun 28 20:40:34 2021 Error in .local(object, con, format, ...) : Scores must be non-NA numeric values Calls: RiboseQC_analysis ... export -> export -> export -> export -> export -> .local In addition: There were 12 warnings (use warnings() to see them) Execution halted

I tried adding the "readlength_choice_method = "all"" parameter in the RiboseQC_analysis, but it gave the same error and it wouldn't export any files.

I am wondering if it has to do with the annotation and reference files I was using for building the bams (https://github.com/ribosomeprofiling/references_for_riboflow/tree/master/transcriptome/mouse/v1/raw)? Or is there something else I might be doing wrong?

Thanks a lot in advance.

zslastman commented 3 years ago

Hey! Thanks for reporting this. It does look like an annotation issue yes. Could you make a sample of your bam available as well so I can debug this? The reference and annotation should be exactly the same as the one used to make the bam files.

imilenkovic commented 3 years ago

Thanks a lot for your reply!

I attached a bam sample (copied the first ~20 rows). bam_sample.txt

Regarding the annotation and the reference files: The annotation is actually a .bed file: https://github.com/ribosomeprofiling/references_for_riboflow/tree/master/transcriptome/mouse/v1/riboflow_annot_and_ref - the one called appris_mouse_v1_actual_regions.bed. The reads were mapped to the transcriptome (but not a single .fa file - bowtie2 was used for mapping so there are 4 .bt2 files).

Let me know if this can still be used or the fastqs will need to be remapped to be compatible with this pipeline.

Thanks a lot!

zslastman commented 3 years ago

Ah ok, so, RiboseQC is actually designed to run with genomic alignments, and a gtf file - you'll need to align your reads to the genome rather than the transcriptome. Gencode gtfs should work fine for this.

imilenkovic commented 3 years ago

Hello,

Thanks for clarifying!

I remapped my reads using STAR, and these are the reference and annotation I used:

--genomeFastaFiles ~/references/GRCm38.p6.genome.fa \ --sjdbGTFfile ~/RiboseQC/gencode.vM27.annotation.gtf \

And then I tried running RiboseQC by using the same reference and same annotation (only fasta file converted to 2bit with faToTwoBit):

prepare_annotation_files(annotation_directory = ".", twobit_file = "~/references/GRCm38.p6.genome.2bit", gtf_file = "~/RiboseQC/gencode.vM27.annotation.gtf", scientific_name = "Mus.musculus", annotation_name = "vM27",export_bed_tables_TxDb = F,forge_BSgenome = T,create_TxDb = T)

but I run into the following error:

Creating the TxDb object --- Done! Fri Jul 2 17:56:03 2021 Extracting genomic regions ... Fri Jul 2 17:56:03 2021 Extracting ids and biotypes ... Fri Jul 2 17:58:32 2021 Error in loadFUN(x, seqname, ranges) : trying to load regions beyond the boundaries of non-circular sequence "chr4" Calls: prepare_annotation_files ... loadSubseqsFromStrandedSequence -> loadFUN -> loadFUN In addition: There were 50 or more warnings (use warnings() to see the first 50) Execution halted

Any ideas what might have gone wrong? Thanks!

zslastman commented 3 years ago

No problem. That looks to me like your gtf file is for a different version of the genome than the one in the fasta (a range in chr4 is asking for a coordinate that's too high) , I would look at the gencode page to confirm you have the right one.