tleonardi / nanocompore

RNA modifications detection from Nanopore dRNA-Seq data
https://nanocompore.rna.rocks
GNU General Public License v3.0
80 stars 12 forks source link

SampCompDB error with bed file #132

Closed samir-watson closed 4 years ago

samir-watson commented 4 years ago

Hi, I am getting the following error when trying to use SampCompDB to save a bed file:

nanocompore.common.NanocomporeError: Some references are missing from the BED file provided

I am using gencode.v33.transcripts.fa for transcript alignment and I am making a bed file using the gft2bed tool using the gencode.v33.annotation.gtf file

Im not sure what the problem is here as I am using the same gencode release (v33) to download both the gtf and the fasta file.

here is my code:

db = SampCompDB (
...     db_fn = "/home/samirwatson/faststorage/NAT10/basecalled/Nanocompore_results/out_SampComp.db",
...     fasta_fn = "/home/samirwatson/faststorage/NAT10/gencode.v33.transcripts.fa",
...     log_level= "warning",
...     bed_fn = "/home/samirwatson/faststorage/NAT10/gencode.v33.annotation.gtf.bed"
... )

here is the traceback:

Traceback (most recent call last): File "", line 5, in File "/home/samirwatson/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/SampCompDB.py", line 108, in init self.results = self.__calculate_results(adjust=True) File "/home/samirwatson/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/SampCompDB.py", line 166, in __calculate_results raise NanocomporeError("Some references are missing from the BED file provided") nanocompore.common.NanocomporeError: Some references are missing from the BED file provided

tleonardi commented 4 years ago

Hi @samir-watson, there must be some mismatch in the transcript IDs between the fasta file and the BED file. You can check what ref_ids are used in the SampCompDB (by loading it without passing a BED file) and compare those to the ref_ids used in the BED file (col 4). Do they match?

The way I usually do it is to convert the Gencode/Ensemble GTF to bed (with bedparse) and then convert the BED to fasta with bedtools getfasta and fix the fasta headers with a regex:

bedtools getfasta -fi ${genome_fasta} -s -split -name -bed reference_transcriptome.bed -fo - | perl -pe 's/>(.+)\\([+-]\\)/>\$1/'

As a reference, you can also check the code of our nextflow pipeline

samir-watson commented 4 years ago

Hi @tleonardi,

I have had a good look at what was happening with and it seems to a problem with pipes in the fasta headers that are being carried over from the fasta I used during SampComp so when I compare the SampCompDB database to the bed file there is no longer a match. I will try and redo the alignment with the pipes and extra IDs removed and see if that works.

On another note I found a tiny issue in SampCompDB.py line 421, it says "red_if" and I think its meand to be ref_id.

Cheers, Samir

tleonardi commented 4 years ago

Hi @samir-watson, thanks for spotting the typo. It's getting fixed