ohlerlab / RiboseQC

Pipeline for Quality Control of Ribo-Seq data, selection of P-site offsets, and codon usage statistics.
15 stars 14 forks source link

Error in 'prepare_annotation_files' after last two commits #18

Open Jelinek-J opened 2 years ago

Jelinek-J commented 2 years ago

After last two commits in recent days, I'm not able to successfully run prepare_annotation_files method (each of the updates introduced another error):

common initialization:

# to have all up to date; gparted just to enlarge disc space, it should not affect anything else
$ sudo apt-get update; sudo apt-get upgrade; sudo apt-get autoremove; sudo apt-get install gparted

# prerequisities
$ sudo apt-get install r-base libxml2-dev libssl-dev libcurl4-openssl-dev pandoc
$ rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/faToTwoBit .
$ chmod 744 faToTwoBit

# download data to work with
$ wget  ftp://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
$ wget  ftp://ftp.ensembl.org/pub/release-104/gtf/homo_sapiens/Homo_sapiens.GRCh38.104.gtf.gz
$ gunzip Homo_sapiens.GRCh38.*.gz
$ wget http://rpredictor.ms.mff.cuni.cz/test.bam # short excerpt to quick check

# preprocessing
$ ./faToTwoBit Homo_sapiens.GRCh38.dna.primary_assembly.fa Homo_sapiens.GRCh38.dna.primary_assembly.2bit

(current) version from 18. 11.:

> install.packages("devtools")
> library("devtools")
> install_github(repo = "ohlerlab/RiboseQC@22ad7736a1f914e2e3c794c84807be909e89ccef")
> library("RiboseQC")
> prepare_annotation_files(".", "Homo_sapiens.GRCh38.dna.primary_assembly.2bit", "Homo_sapiens.GRCh38.104.gtf", annotation_name="GRCh38", genome_seq="Homo_sapiens.GRCh38.dna.primary_assembly.fa")
Creating the TxDb object ... Fri Nov 26 13:38:48 2021
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Creating the TxDb object --- Done! Fri Nov 26 13:40:02 2021
Extracting genomic regions ... Fri Nov 26 13:40:02 2021
Extracting ids and biotypes ... Fri Nov 26 13:43:12 2021
Error in prepare_annotation_files(".", "Homo_sapiens.GRCh38.dna.primary_assembly.2bit",  :
  "transcript_id" %in% gtfdata$type is not TRUE
In addition: Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.

version from 15. 11.:

> install.packages("devtools")
> library("devtools")
> install_github(repo = "ohlerlab/RiboseQC@8aa0139878994b8541e98c7e538a229bc4028abf")
> library("RiboseQC")
> prepare_annotation_files(".", "Homo_sapiens.GRCh38.dna.primary_assembly.2bit", "Homo_sapiens.GRCh38.104.gtf", annotation_name="GRCh38", genome_seq="Homo_sapiens.GRCh38.dna.primary_assembly.fa")
Creating the TxDb object ... Fri Nov 26 13:38:21 2021
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Creating the TxDb object --- Done! Fri Nov 26 13:39:34 2021
Extracting genomic regions ... Fri Nov 26 13:39:34 2021
Extracting ids and biotypes ... Fri Nov 26 13:42:44 2021
Error in prepare_annotation_files(".", "Homo_sapiens.GRCh38.dna.primary_assembly.2bit",  :
  length(unique(gtfdata$transcript_id)) == n_transcripts is not TRUE
In addition: Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.

Older versions from last year worked correctly.

Jelinek-J commented 2 years ago

The last commit 22ad7736a1f914e2e3c794c84807be909e89ccef made a new mistake instead of repairing the older one, the previous version of https://github.com/ohlerlab/RiboseQC/blob/22ad7736a1f914e2e3c794c84807be909e89ccef/R/riboseqc.R#L2693 was correct.

The original problem is caused by that unique(gtfdata$transcript_id) from https://github.com/ohlerlab/RiboseQC/blob/22ad7736a1f914e2e3c794c84807be909e89ccef/R/riboseqc.R#L2692 contains an extra NA in comparison with unique(gtfdata$transcript_id) from https://github.com/ohlerlab/RiboseQC/blob/22ad7736a1f914e2e3c794c84807be909e89ccef/R/riboseqc.R#L2696 Possible solution on StackOverflow