Genome prefix must be the same for custom annotation

rcavalcante / annotatr

Package Homepage: http://bioconductor.org/packages/devel/bioc/html/annotatr.html Bug Reports: https://support.bioconductor.org/p/new/post/?tag_val=annotatr.

26 stars 8 forks source link

Genome prefix must be the same for custom annotation #26

Closed clersdom closed 4 years ago

clersdom commented 4 years ago

Hi,

I am trying to build a custom annotation for annotating lncRNAs, as below

lncRNAs_custom <- file.path("./","lncRNAs.bed") extraCols_lncRNA = c(symbol = 'character') custom_lncRNAs<- read_annotations(lncRNAs_custom, genome = "hg19", extraCols = extraCols_lncRNA, format = "bed", name = "lncRNAs")

head(custom_lncRNAs) GRanges object with 6 ranges and 5 metadata columns: seqnames ranges strand | id tx_id gene_id symbol type

| [1] chr1 84267199-84326229 * | lncRNAs:1 LINC01725:44 hg19_custom_lncRNAs [2] chr16 74226291-74249420 * | lncRNAs:2 lnc-ZFHX3-27:11 hg19_custom_lncRNAs annots_custom = c('hg19_cpgs', 'hg19_basicgenes','hg19_enhancers_fantom', 'custom_lncRNAs') All seems fine until I run the `build_annotations` command. I have got this error message: annotations = build_annotations(genome = 'hg19', annotations = annots_custom) Error in check_annotations(builtin_annotations) : Error: genome prefix on all annotations must be the same. I have checked in builtin_annotations, and I can see there are lncrnas: lncrna_codes = c("hg19_lncrna_gencode", "hg38_lncrna_gencode", "mm10_lncrna_gencode") Do I need to incorporate this somehow to `build_annotations`? I think the error may come from the "name" option when running `read_annotions`, but I can not make it work Many thanks, Clara

rcavalcante commented 4 years ago

Hi Clara,

I think the only issue is that in:

annots_custom = c('hg19_cpgs', 'hg19_basicgenes','hg19_enhancers_fantom', 'custom_lncRNAs')

You should replace custom_lncRNAs with hg19_custom_lncRNAs.

The interface for custom annotations is a little strange (I had some assistance from Bioconductor because they didn't agree with my initial implementation). In short, you don't actually assign the result of read_annotations() to a variable because the custom annotations are held in an object called annotatr_cache. Here is the example from the vignette:

## Use ENCODE ChIP-seq peaks for EZH2 in GM12878
## These files contain chr, start, and end columns
ezh2_file = system.file('extdata', 'Gm12878_Ezh2_peak_annotations.txt.gz', package = 'annotatr')

## Custom annotation objects are given names of the form genome_custom_name
read_annotations(con = ezh2_file, genome = 'hg19', name = 'ezh2', format = 'bed')

# To see the annotations you just created
print(annotatr_cache$get('hg19_custom_ezh2'))

# To see what's in annotatr_cache
print(annotatr_cache$list_env())

Let me know if that solves your problem.

Thanks, Raymond

clersdom commented 4 years ago

Hi Raymond,

It has worked now, thanks! I have checked in print(annotatr_cache$list_env()) [1] "hg19_custom_hg19_lncrna_gencode" "hg19_custom_lncrna"

So just replaced custom_lncRNAs with hg19_custom_lncrna as you have noted.

Thanks! Clara

rcavalcante commented 4 years ago

Wonderful, I'll go ahead and close this issue. I have also made a fix for the lncRNA URL in hg19 which was mentioned in #17.

Rohit-Satyam commented 4 years ago

Hi @rcavalcante I was trying to build lncRNA annotations using the following command: lncRNA = 'hg19_custom_lncRNAs'

anno_basic = build_annotations(genome = 'hg19', annotations = lncRNA)

However, it gives me the following error

Error: ‘hg19_custom_lncRNAs’ not in annotatr_cache

The hg19_lncrna_gencode is however working. Are the two same and is it fine to used gencode one in the place of custom set. If not how can I build annotations for custom lncRNA.

Also does 'hg19_cpgs', 'hg19_basicgenes', 'hg19_genes_intergenic', 'hg19_genes_intronexonboundaries' etc are for UCSC hg19?

rcavalcante commented 4 years ago

Hello,

The only way there could be a hg19_custom_lncRNAs annotation was if you created it yourself with the read_annotations() function. You would need a BED-like file that defined the lncRNA to your satisfaction.

The hg19_lncrna_gencode is the built-in resource that comes from: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.long_noncoding_RNAs.gtf.gz. You could customize the hg19_lncrna_gencode annotations by filtering out whichever class of lncRNAs you didn't need.

> lncrnas_gr = build_annotations(genome = 'hg19', annotations = 'hg19_lncrna_gencode')
Building lncRNA transcripts...
trying URL 'ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.long_noncoding_RNAs.gtf.gz'
Content type 'unknown' length 2330848 bytes (2.2 MB)
==================================================
> table(stringr::str_split_fixed(lncrnas_gr$id, ':', 2)[,1])

3prime_overlapping_ncrna                antisense                  lincRNA 
                      25                     9710                    11780 
                   miRNA                 misc_RNA     processed_transcript 
                      61                       16                      608 
         retained_intron                     rRNA           sense_intronic 
                     483                        4                      802 
       sense_overlapping                   snoRNA                    snRNA 
                     330                       72                        7

As to your second question, the hg19 genic annotations are based on UCSC. The CpGs are also taken UCSC via AnnotationHub ID AH5086.

Hope that helps. And in future, please create a new issue rather than adding a comment onto a closed issue that isn't directly related to your question.

Rohit-Satyam commented 4 years ago

Hi @rcavalcante . Thanks for your prompt rejoinder and my apologies for opening a discussion on a closed one. The gencode lncrna does give broad counts of lncRNA usingthe following command anno_lncrna = annotatr::build_annotations(genome = 'hg19', annotations = 'hg19_lncrna_gencode')

df_anno_lncRNA = annotate_regions( regions = dm_regions, annotations = anno_lncrna, ignore.strand = TRUE, quiet = FALSE) lncRNA = data.frame(df_anno_lncRNA) but if we wish to get more resolved annotation like if they are snRNA or lincRNA in spite of broad annotation term such as hg19_lncrna_gencode. Do I need to go and edit the downloaded gtf file?

rcavalcante commented 4 years ago

No, if you look in the id column, you will note that the particular type of lncRNA is contained therein.

> table(stringr::str_split_fixed(lncrnas_gr$id, ':', 2)[,1])

3prime_overlapping_ncrna                antisense                  lincRNA 
                      25                     9710                    11780 
                   miRNA                 misc_RNA     processed_transcript 
                      61                       16                      608 
         retained_intron                     rRNA           sense_intronic 
                     483                        4                      802 
       sense_overlapping                   snoRNA                    snRNA 
                     330                       72                        7

I made the decision to not enumerate these in the type column because there were so many, and I thought figures would get too crowded. However, you should be able to change the type column to your liking after building the lncRNA annotations.

lncrnas_gr = build_annotations(genome = 'hg19', annotations = 'hg19_lncrna_gencode')
subtypes = str_split_fixed(test$id, ':', 2)[,1]
lncrnas_gr$type = paste('hg19_lncrna', subtypes, sep = '_')

You can then see that the types are:

> unique(lncrnas_gr$type)
 [1] "hg19_lncrna_lincRNA"                 
 [2] "hg19_lncrna_miRNA"                   
 [3] "hg19_lncrna_antisense"               
 [4] "hg19_lncrna_retained_intron"         
 [5] "hg19_lncrna_sense_intronic"          
 [6] "hg19_lncrna_snRNA"                   
 [7] "hg19_lncrna_processed_transcript"    
 [8] "hg19_lncrna_snoRNA"                  
 [9] "hg19_lncrna_sense_overlapping"       
[10] "hg19_lncrna_3prime_overlapping_ncrna"
[11] "hg19_lncrna_misc_RNA"                
[12] "hg19_lncrna_rRNA"

And when you go to create any figures, you will use these annotations in the annotation_order parameter.

Hope that helps.