How to create edb from GTF and add protein data?

zyh4482 commented 4 months ago

I'm using the latest Human Release 112 from Ensembl. But AnnotationHub does not have AH cache for it. So I was trying to build edb from GTF.

However, using test <- proteinToGenome(testdb, edb) to map protein coordinates to genomic coordinates reported the following error message:

Fetching CDS for 125718 proteins ... Error: AnnotationFilter classes: ProteinIdFilter are not supported by this EnsDb database. In addition: Warning message: In cleanColumns(x, unique(c(columns, "tx_id"))) : Column 'protein_id' is not present in the database and has been removed

The reason should be that GTF file from Ensembl does not have protein data. But I don't know how to add these kind of information to it.

Could you please help me with it? Thanks.

zyh4482 commented 4 months ago

Sorry. I think I just found a statement as shown in your vignette.

Also note that protein annotations are usually not available in GTF or GFF files, thus, such annotations will not be included in the generated EnsDb database - protein annotations are only available in EnsDb databases created with the Ensembl Perl API (such as the ones provided through AnnotationHub or as Bioconductor packages).

But when I tried to use Ensembl Perl API, I failed to do so because ensembl API installation is a disaster. I could only use it in Docker. But in such case I cannot call Ensembl API in R. Is there other workaround?

zyh4482 commented 4 months ago

Another issue is about proteinToGenome.

When I tried to use edb created by AH (Release 111), it fails with following error message:

Fetching CDS for 125718 proteins ... Error: subscript contains invalid names

Would you mind telling me how to fix it? I cannot find any malformatted ID.

The following command is used:

  tmpGR <- IRanges(start = data$start, 
                    end = data$end, 
                    names = data$ensembl_transcript_id)

  tmp <- proteinToGenome(tmpGR, edb, 
                          idType = "tx_id")

zyh4482 commented 4 months ago

Update

I change the ID to protein ID. Somehow it works.

But my input data is quite huge. This function proteinToGenome is very time-consuming. It takes > 10 hours...

Then, I met another problem. My full list of transcript IDs is extracted from ensembl biomaRt API. All of them have protein and coding sequence. But ensembldb somehow can't find those proteins. When I manually checked these dropout, they are non-canonical proteins (e.g. using ambiguous start codon). I don't understand why all the sequence and coordinates can be identified in ensembl, but ensembldb still can't find them. Maybe it is the issue with AnnotationHub?

Still, many thanks for this valuable tool!

jorainer commented 3 months ago

I did already generate the Ensembl 112 databases and uploaded them to AnnotionHub - but seems there is some processing delay and it might take a bit until they become available. In case, in the meantime, I could manually share one of the databases with you if you tell me which species you need.

Then, I'm a bit confused by the analysis you are doing. You have Ensembl transcript identifiers and start and end positions within these that you would like to map to the genome - so, why are you then using proteinToGenome() and not transcriptToGenome()?

zyh4482 commented 3 months ago

Glad to hear that! Thanks for your dedicated efforts!

Because in my previous step, the transcript id is used as unique id for each peptide sequence (but peptide id can be used as well. I first thought it would be the same until I realize that I need to use this tool for mapping). These peptide sequence will be used for some analyses, which will return only residue coordinates. These coordinates are the one I need for downstream tasks. A little bit complicated, but seems to be the only choice I guess

zyh4482 commented 3 months ago

I'm working on homo sapiens. Thanks for your kindly help! May I ask how I could get the database and how to load it manually?

guidohooiveld commented 3 months ago

I found this thread earlier this week since I was also checking the status of the Ensembl 112 databases (since the latest Gencode release is v112 based).

@zyh4482 : I just would like to let you know that today I was able to access and download the human (and mouse) v112 EnsDb from the AnnotationHub.

@jorainer : Thank you once again for your continuous efforts to build and provide all of us with the up-to-date Ensembl databases through the AnnotationHub!

> library(AnnotationHub)
> library(ensembldb)
> library(AnnotationForge)
>
> ah <- AnnotationHub()
snapshotDate(): 2024-04-30
>
> EnsDb112 <- query(ah, c("EnsDb", "v112", "Homo sapiens"))
> EnsDb112
AnnotationHub with 1 record
# snapshotDate(): 2024-04-30
# names(): AH116860
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2024-04-30
# $title: Ensembl 112 EnsDb for Homo sapiens
# $description: Gene and protein annotations for Homo sapiens based on Ensem...
# $taxonomyid: 9606
# $genome: GRCh37
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("112", "Annotation", "AnnotationHubSoftware", "Coverage",
#   "DataImport", "EnsDb", "Ensembl", "Gene", "Protein", "Sequencing",
#   "Transcript") 
# retrieve record with 'object[["AH116860"]]' 
> 
>
> EnsDb112 <- EnsDb112[["AH116860"]]
loading from cache
> EnsDb112
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.10
|Creation time: Fri May 17 11:25:50 2024
|ensembl_version: 112
|ensembl_host: localhost
|Organism: Homo sapiens
|taxonomy_id: 9606
|genome_build: GRCh37
|DBSCHEMAVERSION: 2.2
|common_name: human
|species: homo_sapiens
| No. of genes: 64102.
| No. of transcripts: 215647.
|Protein data available.
>

zyh4482 commented 3 months ago

@guidohooiveld Thank you so much!

jorainer / ensembldb

How to create edb from GTF and add protein data? #154