rcavalcante / annotatr

Package Homepage: http://bioconductor.org/packages/devel/bioc/html/annotatr.html Bug Reports: https://support.bioconductor.org/p/new/post/?tag_val=annotatr.
26 stars 8 forks source link

Use AnnotationHub accession directly in lncRNA cases #35

Closed rcavalcante closed 3 years ago

rcavalcante commented 4 years ago

In build_lncrna_annots() we're doing something a bit roundabout by querying and then looking for the genome. This worked when there was only one genome == GRCh38 but now there are two, and this function isn't going to work so well.

Rohit-Satyam commented 3 years ago

Hi @rcavalcante

Can you lead with an example. Like if we wish to download the hg38 lncRNA coordinates how will we do that?

annotatr::build_lncrna_annots(genome = "hg38")

Error: 'build_lncrna_annots' is not an exported object from 'namespace:annotatr'

##also tried this way ###
annotatr::build_annotations(genome = 'hg38', annotations = 'hg38_lncrna_gencode')
snapshotDate(): 2020-04-27
Building lncRNA transcripts...
Error in .local(x, i, j = j, ...) : 'i' must be length 1

How can I download it? Please help

EDIT1: Okay so looking at the source code I realized that for hg19 you have only one lncRNA files but for hg38, you are querying annotation hub that returns two IDs

> ID
[1] "AH49559" "AH75123"

AH49559 sub-set of the main annotation files on the reference chromosomes. They contain only the lncRNA genes. Long non-coding RNA genes are considered the genes with any of those biotypes: 'processed_transcript', 'lincRNA', '3prime_overlapping_ncrna', 'antisense

AH75123 sub-set of the main annotation files on the reference chromosomes. They contain only the lncRNA genes. Long non-coding RNA genes are considered the genes with any of those biotypes: 'processed_transcript', 'lincRNA', '3prime_overlapping_ncrna', 'antisense', 'non_coding', 'sense_intronic' , 'sense_overlapping' , 'TEC' , 'known_ncrna'.

Maybe a good workaround could be to choose the latest release as default for lncRNA or use something like this that provided both ID and the information about the annotations to the users

print(GenomicRanges::mcols(query)[GenomicRanges::mcols(query)$genome == hub_genome, c('genome','description')])
        ID = readline()

The user then can enter their desired ID with which they wish to continue having read that description

EDIT2 For some reason the latest release could not be downloaded using

> ah[[ID[2]]]
loading from cache
Error: failed to load resource
  name: AH75123
  title: gencode.v31.long_noncoding_RNAs.gff3.gz
  reason: argument is of length zero
Rohit-Satyam commented 3 years ago

Can you help me what to use in annotation_class in build_ah_annots function for lncrna?

Rohit-Satyam commented 3 years ago

Okay so after so many hit and trials I was able to do it After looking at the source code and issue posted here and of course struggling for hours, the following code worked.

You should include this line in manual: ##Keep the format intact i.e. in case if you are building custom annotation from BED file or AH, keep annotation_class = 'custom'.


code = c('lncrna' = 'AH49559')
build_ah_annots(genome = 'hg38', ah_codes = code, annotation_class = 'custom')
print(annotatr_cache$list_env())

##print(annotatr_cache$list_env())
##[1] "hg38_custom_lncrna"
anno_lncrna = annotatr::build_annotations(genome = 'hg38', annotations = 'hg38_custom_lncrna')
rcavalcante commented 3 years ago

Again, I apologize for the late reply owing to the holidays.

I agree that the process of annotation creation from resources (and in general) could be more straightforward, and I appreciate the suggestions you've made.