openvax / pyensembl

Python interface to access reference genome features (such as genes, transcripts, and exons) from Ensembl
Apache License 2.0
365 stars 68 forks source link

[How to use with custom data] NCBI human gene model #265

Open damianosmel opened 2 years ago

damianosmel commented 2 years ago

Dear pyensembl team,

I would like to use pyensembl for the genemodel gff for human species download from NCBI.

As this is not from ensembl, I have followed the your mentioned guide. However, I could not in the end do "pyensembl install .."

My detailed steps were:

  1. convert the initial ncbi gff to gtf format agat_convert_sp_gff2gtf.pl --gff /path/to/GCF_000001405.39_GRCh38.p13_genomic.gff -o /path/to/ncbi_grch38.gtf > /path/to/gff2gff_conversion.log

  2. install this new genome model pyensembl install --reference-name GRCh38 --annotation-name NCBI --gtf "/path/to/ncbi_grch38.gtf"

However at the second step I get the attached log output and including the error:

raise ValueError(
ValueError: Column 'transcript_id' can't be primary key of table 'transcript' because it contains repeated values

pyensembl_install_ncbi.log

Grateful to your suggestion on how to resolve this error and so use the ncbi genome annotation :)

Many thanks! Damianos

damianosmel commented 2 years ago

Dear pyensembl team,

I have tried to re-install the ncbi genome model and the relevant fasta files.

to do so, I have done:

from pyensembl import Genome
ncbi_genome_model = Genome(reference_name='GRCh38',annotation_name='ncbi',gtf_path_or_url='GCF_000001405.39_GRCh38.p13_genomic.gtf.gz',transcript_fasta_paths_or_urls='GCF_000001405.39_GRCh38.p13_genomic.fna.gz',protein_fasta_paths_or_urls='GCF_000001405.39_GRCh38.p13_translated_cds.faa.gz',copy_local_files_to_cache=True,cache_directory_path='/home/path/to/.cache/pyensembl')
ncbi_genome_model.index()

Then when I would like to extract the coding sequence for a transcript using ncbi transcript id I get:

slc26a4_transcript = ncbi_genome_model.transcript_by_id("NM_000441.2")
print(slc26a4_transcript.coding_sequence)
# will output: None

But using ensembl annotation it works:

from pyensembl import EnsemblRelease
data = EnsemblRelease(106)
slc26a4_transcript_ensembl = data.transcript_by_id("ENST00000644269")
print(slc26a4_transcript_ensembl.coding_sequence)
INFO:pyensembl.sequence_data:Loaded sequence dictionary from /home/damianos.melidis/.cache/pyensembl/GRCh38/ensembl106/Homo_sapiens.GRCh38.cdna.all.fa.gz.pickle
INFO:pyensembl.sequence_data:Loaded sequence dictionary from /home/damianos.melidis/.cache/pyensembl/GRCh38/ensembl106/Homo_sapiens.GRCh38.ncrna.fa.gz.pickle
# will output: ATGGCA...TGA (skipped the rest of the coding sequence for space)

The output of defining the genome based on ncbi annotation and indexing the annotation is attached. index_ncbi_model_in_code.log

I feel I had some progress by adding the transcript and protein sequence data, but still I can't get the same amount of information as using the ensembl.

Grateful to your ideas on what I am doing wrong here..

Many thanks, Damianos