Closed Surajuvm closed 4 years ago
Hi Suraj,
Thanks for your interest in annotatr.
The main thing you need for custom annotations is a bed file. This can be a simple BED3 file (chr, start, end), or it can be BED6+ with extra columns to define gene_id
, symbol
, tx_id
. You don't necessarily need all three, but can have any combination of them.
Check out the particular documentation here for the read_annotations
function, or do ?read_annotations
in your R session after loading annotatr.
The vignette mentions how to reference your custom annotations after you build them, so give it a shot with one of your BED files. If you have trouble, I can try to help troubleshoot. Just let me know what version of annotatr you're using when/if the time comes.
Thanks, Raymond
Hello Raymond,
Thank you for replying. I tried read_annotations and got the errors listed below. I would appreciate your help to fix this. I'm using annotatr -v1.8.0
library(annotatr) setwd("~/differential methylation/IL")
### Using 5UTR annotation bed file of ARS-UCD1.2 genome to create a custom annotation
utr5_file <- read.table("5UTR", header = F)
utr5_file <- as.data.frame(utr5_file)
### head(utr5_file)
### V1 V2 V3 V4 V5 V6
### 1 chr1 67120159 67120252 NM_001014911_utr5_7_0_chr1_67120160_r 0 -
### 2 chr1 134552745 134553122 NM_001192829_utr5_15_0_chr1_134552746_r 0 -
### 3 chr1 109400972 109400974 NM_001075404_utr5_8_0_chr1_109400973_r 0 -
### 4 chr1 109413651 109413780 NM_001075404_utr5_9_0_chr1_109413652_r 0 -
### 5 chr1 6112135 6112446 NM_001102514_utr5_0_0_chr1_6112136_f 0 +
### 6 chr1 10231784 10231920 NM_001076796_utr5_0_0_chr1_10231785_f 0 +
read_annotations(con = utr5_file, name = "5UTR", genome = "Cow", format = "bed")
When I run the read_annotations code, I get the error Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘import’ for signature ‘"data.frame", "character", "missing"’
Please let me know if there is a quick fix for this.
Thank you, Suraj
Hi Suraj,
From what you've shown me, it looks like you're passing a data.frame
to the con
argument of read_annotations()
instead of the the path to a file. From the documentation of the function:
...
Usage:
read_annotations(con, name, genome = NA, format, extraCols = character(),
...)
Arguments:
con: A path, URL, connection or BEDFile object. See
‘rtracklayer::import.bed()’ documentation.
...
You should pass the path to the 5UTR file you want to use.
Thanks, Raymond
Thank you, Raymond. I was able to build the custom annotation using the path to my annotation file. Now, I am having issues with reading the genomic regions. My differential methylation result looks like this
chr1 587399 + CG 0.00187391 1.01509e-05 0.0362424 66 20 96 8
chr1 1403687 + CG 0.00710123 1.00817e-05 0.0361172 23 18 28 28
chr1 7799358 + CG 0.00210249 1.12248e-06 0.0115237 57 40 38 14
chr1 10393220 - CG 0.00469904 7.67492e-06 0.0316831 13 13 58 41
chr1 10709931 + CG 0.00149387 1.2945e-05 0.0397307 58 47 94 91
chr1 16735688 + CG 0.00211751 2.07757e-05 0.0481334 22 21 33 15
chr1 16735689 - CG 0.00527209 2.07757e-05 0.0481334 27 10 93 64
chr1 18151072 - CG 0.00745319 4.57625e-06 0.0246173 23 19 35 6
chr1 22895661 - CG 0.00206324 8.76838e-08 0.00259897 45 13 80 64
chr1 22895813 + CG 0.00246523 3.22784e-08 0.00135523 24 0 25 10
It contains the chromosome number, position of specific cytosines, strand, CG context, three different p-values, total C in controls, mC in controls, total C in cases, mC in cases.
I tried the following code to read this file
dm_sites <- read_regions(con = "dm_IL.chr.adjusted.pval.FDR0.05.bed", genome = 'Cow', format = 'bed')
I got the error
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 11 elements
Is the error because my file contains only one site as opposed to start/end sites in bed files?\ Is there a way to fix this.
Thank you, Suraj
Hi Suraj,
Yes, this is not a proper BED file (see here), which is what read_regions()
and the function it wraps (rtracklayer::import()
) expects.
You can use a tool like awk
to duplicate the second column and insert it as the third, and get rid of the columns you don't really need. I don't know what the columns represent in this case, so that's up to you.
Raymond
Thank you Raymond for pointing it out. I re-formatted my data and it looks like this now
CHR START END STRAND CONTEXT Pvalue
1 chr1 587399 587399 + CG 0.0362424
2 chr1 1403687 1403687 + CG 0.0361172
3 chr1 7799358 7799358 + CG 0.0115237
4 chr1 10393220 10393220 - CG 0.0316831
5 chr1 10709931 10709931 + CG 0.0397307
6 chr1 16735688 16735688 + CG 0.0481334
7 chr1 16735689 16735689 - CG 0.0481334
8 chr1 18151072 18151072 - CG 0.0246173
9 chr1 22895661 22895661 - CG 0.00259897
With this, I ran the code
dm_sites_IL = read_regions(con = "dm_sites_IL_FDR.bed", genome = 'Cow', format = 'bed')
I got the following error
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'an integer', got 'chr1'
Please let me know if there is a fix for this.
Thank you, Suraj
Never mind. I figured this one out. The row numbers were causing errors.
Hello Raymond,
I got the read_regions to work and created a GRanges data. My file looks like this
> print(dm_sites_IL)
GRanges object with 2125 ranges and 2 metadata columns:
seqnames ranges strand | name score
<Rle> <IRanges> <Rle> | <character> <numeric>
[1] chr1 587400-587399 * | + 0.0362424
[2] chr1 1403688-1403687 * | + 0.0361172
[3] chr1 7799359-7799358 * | + 0.0115237
[4] chr1 10393221-10393220 * | - 0.0316831
[5] chr1 10709932-10709931 * | + 0.0397307
... ... ... ... . ... ...
[2121] chr28 45768859-45768858 * | - 0.0466007
[2122] chr28 45768915-45768914 * | - 0.0466007
[2123] chr29 7628208-7628207 * | + 0.0460903
[2124] chr29 20447590-20447589 * | + 0.0109041
[2125] chrX 137871896-137871895 * | + 0.03187
-------
seqinfo: 30 sequences from Cow genome; no seqlengths
I also made different annotation files
print(annotatr_cache$list_env())
[1] "Cow_custom_3UTR" "Cow_custom_5UTR" "Cow_custom_CDS" "Cow_custom_Downstream"
[5] "Cow_custom_Introns" "Cow_custom_Promoter"
Next, I wanted to intersect my differential methylation file with the annotation files.
annots = c("Cow_custom_3UTR", "Cow_custom_5UTR" , "Cow_custom_CDS", "Cow_custom_Downstream", "Cow_custom_Introns", "Cow_custom_Promoter")
#### Build the annotations (a single GRanges object)
annotations = build_annotations(genome = 'Cow', annotations = annots)
####Intersect the regions we read in with the annotations
dm_annotated = annotate_regions(
regions = dm_sites_IL,
annotations = annotations,
ignore.strand = TRUE,
quiet = FALSE)
At this point, I get error
Annotating...
Error in annotate_regions(regions = dm_sites_IL, annotations = annotations, :
No annotations intersect the regions.
It says that no annotations intersect with the regions in my differential methylation file; but I manually checked few of them and there are sites which are common.
I would appreciate your input here.
Thank you, Suraj
Hi Suraj,
Good news that you got both the regions read in and the custom annotations built.
It would be helpful if you could provide some examples of where, upon your inspection, there should be overlaps but aren't.
Thanks, Raymond
For example, this is a region of my differentially methylated sites
chr6 29591030 29591030 - 0.0177636
chr6 40082995 40082995 + 0.00793686
chr6 40156410 40156410 - 0.00143386
chr6 54845641 54845641 + 0.0395046
chr6 55069007 55069007 - 0.0404003
chr6 64595848 64595848 + 0.0295995
chr6 64595849 64595849 - 0.0295995
chr6 67577905 67577905 + 0.0495847
chr6 67578058 67578058 + 0.0495847
Let's track the 2nd row; chr6 40082995
This is a region of a annotation file for cds.
chr6 40082609 40082753 NM_001191516_cds_13_0_chr6_40082610_f 0 +
chr6 40082865 40083029 NM_001191516_cds_14_0_chr6_40082866_f 0 +
chr6 40083995 40084019 NM_001191516_cds_15_0_chr6_40083996_f 0 +
chr6 40087803 40087954 NM_001191516_cds_16_0_chr6_40087804_f 0 +
chr6 40091587 40091662 NM_001191516_cds_17_0_chr6_40091588_f 0 +
chr6 40093232 40093376 NM_001191516_cds_18_0_chr6_40093233_f 0 +
chr6 40098898 40099042 NM_001191516_cds_19_0_chr6_40098899_f 0 +
chr6 40100533 40100700 NM_001191516_cds_20_0_chr6_40100534_f 0 +
Here, the second row chr6 40082865 should intersect the above position present in my differentially methylated sites file.
I tried running my codes again, this time I got a different error.
Annotating...
Error in mergeNamedAtomicVectors(genome(x), genome(y), what = c("sequence", :
sequences chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX have incompatible genomes:
- in 'x': Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow, Cow
- in 'y': hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19
Thank you
Hi,
Can you please provide all the code that you have run? Including reading the custom annotations, building the custom annotations, reading the regions, and then intersecting?
Somehow your annotations are given the genome of hg19 which is why the above error is given. You can't check for overlaps in two GRanges objects if their genomes are different.
As to the intersection problem, this is perplexing. The annotate_regions()
function really just does:
intersections = GenomicRanges::findOverlaps(regions, annotations, minoverlap = minoverlap, ignore.strand = ignore.strand)
And here is a toy example with the exact example you gave:
test1 = GRanges(seqnames = 'chr6', ranges = IRanges(start = 40082995, end = 40082995), strand = '+')
test2 = GRanges(seqnames = 'chr6', ranges = IRanges(start = 40082865, end = 40083029), strand = '+')
findOverlaps(test1, test2, minoverlap = 1L) # 1L is the default
# Results in what we'd expect
# Hits object with 1 hit and 0 metadata columns:
# queryHits subjectHits
# <integer> <integer>
# [1] 1 1
# -------
# queryLength: 1 / subjectLength: 1
Did you perhaps change the minoverlap
parameter in the call to annotate_regions()
?
Thanks, Raymond
Do you think it will be easier if I email you R script along with the subset of my sample files?
Sure. rcavalca@umich.edu
Hi Suraj,
I understand what the problem is, notice that after reading the regions in, the start of the range is after the end of the range, which isn't valid, but also doesn't throw an error. I also realize that the same is shown above when you showed me the output, I just missed it.
> dm_sites_IL <- read_regions(con = "dm_sites_IL_FDR.txt", genome = 'Cow', format = 'bed')
> dm_sites_IL
GRanges object with 2125 ranges and 2 metadata columns:
seqnames ranges strand | name score
<Rle> <IRanges> <Rle> | <character> <numeric>
[1] chr1 587400-587399 * | + 0.0362424
[2] chr1 1403688-1403687 * | + 0.0361172
[3] chr1 7799359-7799358 * | + 0.0115237
[4] chr1 10393221-10393220 * | - 0.0316831
[5] chr1 10709932-10709931 * | + 0.0397307
... ... ... ... . ... ...
[2121] chr28 45768859-45768858 * | - 0.0466007
[2122] chr28 45768915-45768914 * | - 0.0466007
[2123] chr29 7628208-7628207 * | + 0.0460903
[2124] chr29 20447590-20447589 * | + 0.0109041
[2125] chrX 137871896-137871895 * | + 0.03187
-------
seqinfo: 30 sequences from Cow genome; no seqlengths
So now we come to the dreaded 0- or 1-based discussion which is a consistent annoyance. BED files are, according to their definition, 0-based. So the rtracklayer::import
function adds 1 to the start of a range. Hence if start = end, you end up with the situation we're in.
I'm not sure if the file you started with is 0- or 1-based, but depending, after you read in the data you can do:
library(GenomicRanges
end(dm_sites_IL) = end(dm_sites_IL) + 1
library(GenomicRanges
start(dm_sites_IL) = start(dm_sites_IL) - 1
And you do the hokey pokey and you turn yourself around.
Sorry, I haven't encountered this particular issue in a while, so it wasn't top of mind. I tested your regions by doing both of the above and the problem appears resolved.
If this solves your problem as well, go ahead and close the issue.
Thanks, and sorry for the trouble, Raymond
Thank you Raymond.
Hi, I'm planning to use annotatr to annotate differentially methylated regions into bovine ARS-UCD1.2 genome assembly. With my limited knowledge (after reading the vignette) I understand that annotatr can be customized and used for annotating genome other than the ones already present in the package.
I would really appreciate if you could suggest me how to get started with this.
Thank you, Suraj