rcavalcante / annotatr

Package Homepage: http://bioconductor.org/packages/devel/bioc/html/annotatr.html Bug Reports: https://support.bioconductor.org/p/new/post/?tag_val=annotatr.
26 stars 8 forks source link

Build custom annotation for bovine genome #24

Closed Surajuvm closed 4 years ago

Surajuvm commented 5 years ago

Hi, I'm planning to use annotatr to annotate differentially methylated regions into bovine ARS-UCD1.2 genome assembly. With my limited knowledge (after reading the vignette) I understand that annotatr can be customized and used for annotating genome other than the ones already present in the package.

I would really appreciate if you could suggest me how to get started with this.

Thank you, Suraj

rcavalcante commented 5 years ago

Hi Suraj,

Thanks for your interest in annotatr.

The main thing you need for custom annotations is a bed file. This can be a simple BED3 file (chr, start, end), or it can be BED6+ with extra columns to define gene_id, symbol, tx_id. You don't necessarily need all three, but can have any combination of them.

Check out the particular documentation here for the read_annotations function, or do ?read_annotations in your R session after loading annotatr.

The vignette mentions how to reference your custom annotations after you build them, so give it a shot with one of your BED files. If you have trouble, I can try to help troubleshoot. Just let me know what version of annotatr you're using when/if the time comes.

Thanks, Raymond

Surajuvm commented 4 years ago

Hello Raymond,

Thank you for replying. I tried read_annotations and got the errors listed below. I would appreciate your help to fix this. I'm using annotatr -v1.8.0

library(annotatr) setwd("~/differential methylation/IL")

### Using 5UTR annotation bed file of ARS-UCD1.2 genome to create a custom annotation 

utr5_file <- read.table("5UTR", header = F)
utr5_file <- as.data.frame(utr5_file)
### head(utr5_file)
### V1        V2        V3                                      V4 V5 V6
### 1 chr1  67120159  67120252   NM_001014911_utr5_7_0_chr1_67120160_r  0  -
###  2 chr1 134552745 134553122 NM_001192829_utr5_15_0_chr1_134552746_r  0  -
###  3 chr1 109400972 109400974  NM_001075404_utr5_8_0_chr1_109400973_r  0  -
###   4 chr1 109413651 109413780  NM_001075404_utr5_9_0_chr1_109413652_r  0  -
###   5 chr1   6112135   6112446    NM_001102514_utr5_0_0_chr1_6112136_f  0  +
###   6 chr1  10231784  10231920   NM_001076796_utr5_0_0_chr1_10231785_f  0  +

read_annotations(con = utr5_file, name = "5UTR", genome = "Cow", format = "bed")

When I run the read_annotations code, I get the error Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘import’ for signature ‘"data.frame", "character", "missing"’

Please let me know if there is a quick fix for this.

Thank you, Suraj

rcavalcante commented 4 years ago

Hi Suraj,

From what you've shown me, it looks like you're passing a data.frame to the con argument of read_annotations() instead of the the path to a file. From the documentation of the function:

...
Usage:

     read_annotations(con, name, genome = NA, format, extraCols = character(),
       ...)

Arguments:

     con: A path, URL, connection or BEDFile object. See
          ‘rtracklayer::import.bed()’ documentation.
...

You should pass the path to the 5UTR file you want to use.

Thanks, Raymond

Surajuvm commented 4 years ago

Thank you, Raymond. I was able to build the custom annotation using the path to my annotation file. Now, I am having issues with reading the genomic regions. My differential methylation result looks like this

chr1 587399 + CG 0.00187391 1.01509e-05 0.0362424 66 20 96 8
chr1 1403687 + CG 0.00710123 1.00817e-05 0.0361172 23 18 28 28
chr1 7799358 + CG 0.00210249 1.12248e-06 0.0115237 57 40 38 14
chr1 10393220 - CG 0.00469904 7.67492e-06 0.0316831 13 13 58 41
chr1 10709931 + CG 0.00149387 1.2945e-05 0.0397307 58 47 94 91
chr1 16735688 + CG 0.00211751 2.07757e-05 0.0481334 22 21 33 15
chr1 16735689 - CG 0.00527209 2.07757e-05 0.0481334 27 10 93 64
chr1 18151072 - CG 0.00745319 4.57625e-06 0.0246173 23 19 35 6
chr1 22895661 - CG 0.00206324 8.76838e-08 0.00259897 45 13 80 64
chr1 22895813 + CG 0.00246523 3.22784e-08 0.00135523 24 0 25 10

It contains the chromosome number, position of specific cytosines, strand, CG context, three different p-values, total C in controls, mC in controls, total C in cases, mC in cases.

I tried the following code to read this file

dm_sites <- read_regions(con = "dm_IL.chr.adjusted.pval.FDR0.05.bed", genome = 'Cow', format = 'bed')

I got the error

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 1 did not have 11 elements

Is the error because my file contains only one site as opposed to start/end sites in bed files?\ Is there a way to fix this.

Thank you, Suraj

rcavalcante commented 4 years ago

Hi Suraj,

Yes, this is not a proper BED file (see here), which is what read_regions() and the function it wraps (rtracklayer::import()) expects.

You can use a tool like awk to duplicate the second column and insert it as the third, and get rid of the columns you don't really need. I don't know what the columns represent in this case, so that's up to you.

Raymond

Surajuvm commented 4 years ago

Thank you Raymond for pointing it out. I re-formatted my data and it looks like this now

CHR START   END STRAND  CONTEXT Pvalue
1   chr1    587399  587399  +   CG  0.0362424
2   chr1    1403687 1403687 +   CG  0.0361172
3   chr1    7799358 7799358 +   CG  0.0115237
4   chr1    10393220    10393220    -   CG  0.0316831
5   chr1    10709931    10709931    +   CG  0.0397307
6   chr1    16735688    16735688    +   CG  0.0481334
7   chr1    16735689    16735689    -   CG  0.0481334
8   chr1    18151072    18151072    -   CG  0.0246173
9   chr1    22895661    22895661    -   CG  0.00259897

With this, I ran the code

dm_sites_IL = read_regions(con = "dm_sites_IL_FDR.bed", genome = 'Cow', format = 'bed')

I got the following error

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  scan() expected 'an integer', got 'chr1'

Please let me know if there is a fix for this.

Thank you, Suraj

Surajuvm commented 4 years ago

Never mind. I figured this one out. The row numbers were causing errors.

Surajuvm commented 4 years ago

Hello Raymond,

I got the read_regions to work and created a GRanges data. My file looks like this

> print(dm_sites_IL)
GRanges object with 2125 ranges and 2 metadata columns:
         seqnames              ranges strand |        name     score
            <Rle>           <IRanges>  <Rle> | <character> <numeric>
     [1]     chr1       587400-587399      * |           + 0.0362424
     [2]     chr1     1403688-1403687      * |           + 0.0361172
     [3]     chr1     7799359-7799358      * |           + 0.0115237
     [4]     chr1   10393221-10393220      * |           - 0.0316831
     [5]     chr1   10709932-10709931      * |           + 0.0397307
     ...      ...                 ...    ... .         ...       ...
  [2121]    chr28   45768859-45768858      * |           - 0.0466007
  [2122]    chr28   45768915-45768914      * |           - 0.0466007
  [2123]    chr29     7628208-7628207      * |           + 0.0460903
  [2124]    chr29   20447590-20447589      * |           + 0.0109041
  [2125]     chrX 137871896-137871895      * |           +   0.03187
  -------
  seqinfo: 30 sequences from Cow genome; no seqlengths

I also made different annotation files

 print(annotatr_cache$list_env())
[1] "Cow_custom_3UTR"       "Cow_custom_5UTR"       "Cow_custom_CDS"        "Cow_custom_Downstream"
[5] "Cow_custom_Introns"    "Cow_custom_Promoter"

Next, I wanted to intersect my differential methylation file with the annotation files.

annots = c("Cow_custom_3UTR", "Cow_custom_5UTR" , "Cow_custom_CDS", "Cow_custom_Downstream", "Cow_custom_Introns", "Cow_custom_Promoter")

#### Build the annotations (a single GRanges object)
annotations = build_annotations(genome = 'Cow', annotations = annots)
####Intersect the regions we read in with the annotations
dm_annotated = annotate_regions(
    regions = dm_sites_IL,
    annotations = annotations,
    ignore.strand = TRUE,
    quiet = FALSE)

At this point, I get error

Annotating...
Error in annotate_regions(regions = dm_sites_IL, annotations = annotations,  : 
  No annotations intersect the regions.

It says that no annotations intersect with the regions in my differential methylation file; but I manually checked few of them and there are sites which are common.

I would appreciate your input here.

Thank you, Suraj

rcavalcante commented 4 years ago

Hi Suraj,

Good news that you got both the regions read in and the custom annotations built.

It would be helpful if you could provide some examples of where, upon your inspection, there should be overlaps but aren't.

Thanks, Raymond

Surajuvm commented 4 years ago

For example, this is a region of my differentially methylated sites

chr6    29591030    29591030    -   0.0177636
chr6    40082995    40082995    +   0.00793686
chr6    40156410    40156410    -   0.00143386
chr6    54845641    54845641    +   0.0395046
chr6    55069007    55069007    -   0.0404003
chr6    64595848    64595848    +   0.0295995
chr6    64595849    64595849    -   0.0295995
chr6    67577905    67577905    +   0.0495847
chr6    67578058    67578058    +   0.0495847

Let's track the 2nd row; chr6 40082995

This is a region of a annotation file for cds.

chr6    40082609    40082753    NM_001191516_cds_13_0_chr6_40082610_f   0   +
chr6    40082865    40083029    NM_001191516_cds_14_0_chr6_40082866_f   0   +
chr6    40083995    40084019    NM_001191516_cds_15_0_chr6_40083996_f   0   +
chr6    40087803    40087954    NM_001191516_cds_16_0_chr6_40087804_f   0   +
chr6    40091587    40091662    NM_001191516_cds_17_0_chr6_40091588_f   0   +
chr6    40093232    40093376    NM_001191516_cds_18_0_chr6_40093233_f   0   +
chr6    40098898    40099042    NM_001191516_cds_19_0_chr6_40098899_f   0   +
chr6    40100533    40100700    NM_001191516_cds_20_0_chr6_40100534_f   0   +

Here, the second row chr6 40082865 should intersect the above position present in my differentially methylated sites file.

I tried running my codes again, this time I got a different error.

Annotating...
Error in mergeNamedAtomicVectors(genome(x), genome(y), what = c("sequence",  : 
  sequences chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX have incompatible genomes:
  - in 'x': Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow
  - in 'y': hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19

Thank you

rcavalcante commented 4 years ago

Hi,

Can you please provide all the code that you have run? Including reading the custom annotations, building the custom annotations, reading the regions, and then intersecting?

Somehow your annotations are given the genome of hg19 which is why the above error is given. You can't check for overlaps in two GRanges objects if their genomes are different.

As to the intersection problem, this is perplexing. The annotate_regions() function really just does:

intersections = GenomicRanges::findOverlaps(regions, annotations, minoverlap = minoverlap, ignore.strand = ignore.strand)

And here is a toy example with the exact example you gave:

test1 = GRanges(seqnames = 'chr6', ranges = IRanges(start = 40082995, end = 40082995), strand = '+')

test2 = GRanges(seqnames = 'chr6', ranges = IRanges(start = 40082865, end = 40083029), strand = '+')

findOverlaps(test1, test2, minoverlap = 1L) # 1L is the default

# Results in what we'd expect
# Hits object with 1 hit and 0 metadata columns:
#       queryHits subjectHits
#       <integer>   <integer>
#   [1]         1           1
#   -------
#   queryLength: 1 / subjectLength: 1

Did you perhaps change the minoverlap parameter in the call to annotate_regions()?

Thanks, Raymond

Surajuvm commented 4 years ago

Do you think it will be easier if I email you R script along with the subset of my sample files?

rcavalcante commented 4 years ago

Sure. rcavalca@umich.edu

rcavalcante commented 4 years ago

Hi Suraj,

I understand what the problem is, notice that after reading the regions in, the start of the range is after the end of the range, which isn't valid, but also doesn't throw an error. I also realize that the same is shown above when you showed me the output, I just missed it.

> dm_sites_IL <- read_regions(con = "dm_sites_IL_FDR.txt", genome = 'Cow', format = 'bed')
> dm_sites_IL
GRanges object with 2125 ranges and 2 metadata columns:
         seqnames              ranges strand |        name     score
            <Rle>           <IRanges>  <Rle> | <character> <numeric>
     [1]     chr1       587400-587399      * |           + 0.0362424
     [2]     chr1     1403688-1403687      * |           + 0.0361172
     [3]     chr1     7799359-7799358      * |           + 0.0115237
     [4]     chr1   10393221-10393220      * |           - 0.0316831
     [5]     chr1   10709932-10709931      * |           + 0.0397307
     ...      ...                 ...    ... .         ...       ...
  [2121]    chr28   45768859-45768858      * |           - 0.0466007
  [2122]    chr28   45768915-45768914      * |           - 0.0466007
  [2123]    chr29     7628208-7628207      * |           + 0.0460903
  [2124]    chr29   20447590-20447589      * |           + 0.0109041
  [2125]     chrX 137871896-137871895      * |           +   0.03187
  -------
  seqinfo: 30 sequences from Cow genome; no seqlengths

So now we come to the dreaded 0- or 1-based discussion which is a consistent annoyance. BED files are, according to their definition, 0-based. So the rtracklayer::import function adds 1 to the start of a range. Hence if start = end, you end up with the situation we're in.

I'm not sure if the file you started with is 0- or 1-based, but depending, after you read in the data you can do:

  1. If the original coordinates are 0-based,
    library(GenomicRanges
    end(dm_sites_IL) = end(dm_sites_IL) + 1
  2. If the original coordinates are 1-based,
    library(GenomicRanges
    start(dm_sites_IL) = start(dm_sites_IL) - 1

And you do the hokey pokey and you turn yourself around.

Sorry, I haven't encountered this particular issue in a while, so it wasn't top of mind. I tested your regions by doing both of the above and the problem appears resolved.

If this solves your problem as well, go ahead and close the issue.

Thanks, and sorry for the trouble, Raymond

Surajuvm commented 4 years ago

Thank you Raymond.