rcavalcante / annotatr

Package Homepage: http://bioconductor.org/packages/devel/bioc/html/annotatr.html Bug Reports: https://support.bioconductor.org/p/new/post/?tag_val=annotatr.
26 stars 8 forks source link

Annotations for just the main isoform of a gene? #29

Closed ben-laufer closed 5 months ago

ben-laufer commented 4 years ago

Hello,

I was wondering if it's possible to retrieve the annotations for just the main isoform of each gene in the genome and not all the other isoforms? I think this would help refine downstream enrichment testing, since otherwise there is a fair amount of overlap between features, and having this sort of filtering option would reduce it.

Below is the current call that I'd like to refine:

genome <- "hg38"
annotations <- build_annotations(genome = genome, annotations = c(paste(genome,"_basicgenes", sep = ""),
                                                                    paste(genome,"_genes_intergenic", sep = ""),
                                                                    paste(genome,"_genes_intronexonboundaries", sep = ""),
                                                                    if(genome == "hg38" | genome == "mm10"){paste(genome,"_enhancers_fantom", sep = "")})) %>%
    GenomeInfoDb::keepStandardChromosomes(., pruning.mode = "coarse")

Thanks,

Ben

rcavalcante commented 4 years ago

Hi Ben,

Sorry for the delay. I understand the need for a more concise list of annotations, though, I wonder how to choose a "main isoform" because it may change depending the biological context.

One possible alternative would be to layer the isoforms on top of each other with some prioritization of annotations (e.g. promoter > 5'UTR > 3'UTR > exon > intron). I sort of wanted to avoid this, but I could see how it would be possible to allow the user to define it, rather than me making a decision someone will dislike.

Can you help me to better understand "main isoform"?

Thanks, Raymond

ben-laufer commented 4 years ago

Hi Raymond,

I can definitely see the challenge with defining a main isoform of a gene because, as you said, it can be tissue and context specific and thus there is no perfect approach. I was thinking maybe something along the lines of Matched Annotation from NCBI and EMBL-EBI (MANE) or even just some sort of summary for highest expression across a certain number of tissues from GTEx?

I can also see the limitations to layering, but that could be something interesting to explore as well.

For this enrichment testing I'm doing downstream of annotatr, because it is relative to background regions, I am finding that one dataset is de-enriched for all genic annotations, so I wanted to see how reducing the overlap would change this.

Anyways, these were just some possible enhancements I was curious about and I've really been enjoying using annotatr, so thanks for making such a great program.

Best,

Ben

rcavalcante commented 4 years ago

Hi Ben,

Sorry for the delay in following up. Thank you for the link to the MANE resource. The links to GTFs/GFFs will make this easier. I can see a course from those GTFs/GFFs, through GenomicFeatures::makeTxDbFromGFF to the genic annotations you're proposing.

Related to another issue (#31), I was struck that a refactor of the annotation building code would help make the system more extensible (as I tried to outline for myself in #32).

Bioconductor has its next release on Monday (04/27), and with my workload I don't think I can accomplish this by then. However, give me a couple weeks, and I think I'll be able to get this refactored and working.

Thanks, Raymond