build_annotations(genome = 'hg19', annotations = 'hg19_genes_promoters') returns promoters with no gene_id or symbol

kdkorthauer commented 7 years ago

Hi,

I am using the build_annotations function to fetch the promoter (<1kb) regions of hg19 genes using:

build_annotations(genome = 'hg19', annotations = 'hg19_genes_promoters')

This gives me a Granges object with 82,960 regions, 9,528 of which have missing values for both gene_id and symbol. Presumably, the promoters with no Entrez gene id or symbol associated with them are obtained from a larger set of genes than the ones used to obtain gene labels. Is there a way to return an alternate gene id (perhaps Ensembl?) associated with the promoters that do not have an Entrez id or symbol?

I am using annotatr_1.0.3 with R 3.3.1.

Thanks! Keegan

rcavalcante commented 7 years ago

Hello,

I use the TxDb.Hsapiens.UCSC.hg19.knownGene package to build the hg19 gene annotations, and this includes a more expansive list of transcripts (the larger set of genes you mentioned) that don't always have Entrez Gene IDs and gene symbols (as you discovered).

Your point of including alternative IDs is well taken. My guess would be that transcripts without Entrez IDs are less likely to have ENSEMBL Gene IDs, and more likely to have ENSEMBL Transcript IDs. Would including the ENSEMBL Transcript IDs be useful to you? Including an extra column for these IDs would be pretty easy to add in a future update.

For the time being, the UCSC Transcript names (e.g. uc057atz.1) might help you track down more information about the annotations, and the knownToEnsembl table from UCSC can take you from the UCSC Transcript Names to the ENSEMBL Transcript IDs.

Please let me know if that's not quite what you had in mind.

Thanks for using annotatr! Raymond

PS Were you at the ITCR meeting at the Broad this past June? Your name and picture ring a bell.

kdkorthauer commented 7 years ago

Hi Raymond,

Thanks for you prompt response. That's a great point about the transcripts without Entrez IDs being less likely to have Ensembl Gene IDs. And I completely missed the column with UCSC Transcript names. With the knownToEnsembl table, I can get exactly what I need. I don't think it's necessary to provide Ensembl transcript IDs since the UCSC ones are already there.

Thanks for the clarification and advice!

PS. Your name is also familiar to me... I was not at the ITCR meeting; perhaps we've crossed paths at a different meeting?

Best, Keegan

rcavalcante / annotatr

build_annotations(genome = 'hg19', annotations = 'hg19_genes_promoters') returns promoters with no gene_id or symbol #2