Closed Tim-Yu closed 2 years ago
Hi Tim, thanks for posting this problem. It seems to affect those input GRanges
objects that have a mix of ranges of width 1 and > 1 and are not ordered. I have submitted a fix to the current devel (2.7.2) and release (2.6.1) versions of the package, which will become available in the next 24/48 hours. In the meantime, you can check that the problem doesn't show up if you order your GRanges
object first, i.e., in your example you do:
UORFrange_conservation <- gscores(phast,sort(UORFrange))
UORFrange_conservation_subest_455 <- gscores(phast,sort(UORFrange)[121000:121454])
UORFrange_conservation_subest_10455 <- gscores(phast,sort(UORFrange)[111000:121454])
UORFrange_conservation_subest_50455 <- gscores(phast,sort(UORFrange)[71000:121454])
The fixed versions will check whether the input is unordered and would internally order the input and return the result in the original requested order.
Hi Robert,
Thanks for this quick reply! I think it may still be related to the GRanges size? The sorted GRanges are still giving different scores as shown below. Both the single GRanges and short GRange queries seem to be correct according to the results using bwtool.
UORFrange_sorted <- sort(UORFrange)
UORFrange_conservation_subest_1 <- gscores(phast,UORFrange_sorted[71000:72000])
UORFrange_conservation_subest_2 <- gscores(phast,UORFrange_sorted[71000:100000])
UORFrange_conservation_subest_3 <- gscores(phast,UORFrange_sorted[71000:121454])
gscores(phast,sort(UORFrange)[71000])
head(UORFrange_conservation_subest_1,n = 20)
head(UORFrange_conservation_subest_2,n = 20)
head(UORFrange_conservation_subest_3,n = 20)
GRanges object with 1 range and 6 metadata columns:
seqnames ranges strand | enstranscID type Gene uorf_id conservation default
<Rle> <IRanges> <Rle> | <character> <character> <character> <character> <numeric> <numeric>
[1] chr10 75568722-75568766 + | ENSMUST00000145928 oORF Ggt1 ENSMUST00000145928_89 0.0529412 0.415556
-------
seqinfo: 66 sequences (1 circular) from Genome Reference Consortium GRCm38 genome
GRanges object with 20 ranges and 6 metadata columns:
seqnames ranges strand | enstranscID type Gene uorf_id conservation default
<Rle> <IRanges> <Rle> | <character> <character> <character> <character> <numeric> <numeric>
[1] chr10 75568722-75568766 + | ENSMUST00000145928 oORF Ggt1 ENSMUST00000145928_89 0.05294118 0.415556
[2] chr10 75568722-75568766 + | ENSMUST00000131565 oORF Ggt1 ENSMUST00000131565_82 0.00666667 0.415556
[3] chr10 75568722-75568770 + | ENSMUST00000134503 nORF Ggt1 ENSMUST00000134503_67 0.04310345 0.381633
[4] chr10 75568722-75568770 + | ENSMUST00000151212 nORF Ggt1 ENSMUST00000151212_49 0.40000000 0.381633
[5] chr10 75568722-75568770 + | ENSMUST00000128886 nORF Ggt1 ENSMUST00000128886_61 0.00000000 0.381633
... ... ... ... . ... ... ... ... ... ...
[16] chr10 75573611-75573636 + | ENSMUST00000151212 nORF Ggt1 ENSMUST00000151212_49 0.0133333 0.107692
[17] chr10 75573611-75573636 + | ENSMUST00000128886 nORF Ggt1 ENSMUST00000128886_61 0.2055556 0.107692
[18] chr10 75573611-75573636 + | ENSMUST00000143226 nORF Ggt1 ENSMUST00000143226_61 1.0000000 0.107692
[19] chr10 75573611-75573636 + | ENSMUST00000124259 nORF Ggt1 ENSMUST00000124259_56 0.0000000 0.107692
[20] chr10 75573611-75573748 + | ENSMUST00000145928 oORF Ggt1 ENSMUST00000145928_89 0.6666667 0.034058
-------
seqinfo: 66 sequences (1 circular) from Genome Reference Consortium GRCm38 genome
GRanges object with 20 ranges and 6 metadata columns:
seqnames ranges strand | enstranscID type Gene uorf_id conservation default
<Rle> <IRanges> <Rle> | <character> <character> <character> <character> <numeric> <numeric>
[1] chr10 75568722-75568766 + | ENSMUST00000145928 oORF Ggt1 ENSMUST00000145928_89 0.05294118 0.0431034
[2] chr10 75568722-75568766 + | ENSMUST00000131565 oORF Ggt1 ENSMUST00000131565_82 0.00666667 0.0431034
[3] chr10 75568722-75568770 + | ENSMUST00000134503 nORF Ggt1 ENSMUST00000134503_67 0.04310345 0.0222222
[4] chr10 75568722-75568770 + | ENSMUST00000151212 nORF Ggt1 ENSMUST00000151212_49 0.40000000 0.0222222
[5] chr10 75568722-75568770 + | ENSMUST00000128886 nORF Ggt1 ENSMUST00000128886_61 0.00000000 0.0000000
... ... ... ... . ... ... ... ... ... ...
[16] chr10 75573611-75573636 + | ENSMUST00000151212 nORF Ggt1 ENSMUST00000151212_49 0.0133333 0.00909091
[17] chr10 75573611-75573636 + | ENSMUST00000128886 nORF Ggt1 ENSMUST00000128886_61 0.2055556 0.79090909
[18] chr10 75573611-75573636 + | ENSMUST00000143226 nORF Ggt1 ENSMUST00000143226_61 1.0000000 0.47797619
[19] chr10 75573611-75573636 + | ENSMUST00000124259 nORF Ggt1 ENSMUST00000124259_56 0.0000000 0.46848485
[20] chr10 75573611-75573748 + | ENSMUST00000145928 oORF Ggt1 ENSMUST00000145928_89 0.6666667 0.46848485
-------
seqinfo: 66 sequences (1 circular) from Genome Reference Consortium GRCm38 genome
GRanges object with 20 ranges and 6 metadata columns:
seqnames ranges strand | enstranscID type Gene uorf_id conservation default
<Rle> <IRanges> <Rle> | <character> <character> <character> <character> <numeric> <numeric>
[1] chr10 75568722-75568766 + | ENSMUST00000145928 oORF Ggt1 ENSMUST00000145928_89 0.05294118 0.1135802
[2] chr10 75568722-75568766 + | ENSMUST00000131565 oORF Ggt1 ENSMUST00000131565_82 0.00666667 0.0566667
[3] chr10 75568722-75568770 + | ENSMUST00000134503 nORF Ggt1 ENSMUST00000134503_67 0.04310345 0.0000000
[4] chr10 75568722-75568770 + | ENSMUST00000151212 nORF Ggt1 ENSMUST00000151212_49 0.40000000 0.1686869
[5] chr10 75568722-75568770 + | ENSMUST00000128886 nORF Ggt1 ENSMUST00000128886_61 0.00000000 0.0000000
... ... ... ... . ... ... ... ... ... ...
[16] chr10 75573611-75573636 + | ENSMUST00000151212 nORF Ggt1 ENSMUST00000151212_49 0.0133333 0.000000
[17] chr10 75573611-75573636 + | ENSMUST00000128886 nORF Ggt1 ENSMUST00000128886_61 0.2055556 0.000000
[18] chr10 75573611-75573636 + | ENSMUST00000143226 nORF Ggt1 ENSMUST00000143226_61 1.0000000 0.000000
[19] chr10 75573611-75573636 + | ENSMUST00000124259 nORF Ggt1 ENSMUST00000124259_56 0.0000000 0.159524
[20] chr10 75573611-75573748 + | ENSMUST00000145928 oORF Ggt1 ENSMUST00000145928_89 0.6666667 0.159524
-------
seqinfo: 66 sequences (1 circular) from Genome Reference Consortium GRCm38 genome
I have tried to troubleshoot myself, and I found that a 'smaller' GRanges will give a correct result, but the 'smaller' seems not to be a fixed number. Therefore, it might due to the probability of some pattern showing up in my data is adding up when the number of the ranges is getting high.
Best wishes,
Tim
Hi, using the last fixed version I don't see these differences:
library(GenomicRanges)
load("gscores_output_compare.RData")
UORFrange <- GRanges(UORFrange)
UORFrange_sorted <- sort(UORFrange)
library(GenomicScores)
phast <- getGScores("phastCons60way.UCSC.mm10")
UORFrange_conservation <- gscores(phast,UORFrange_sorted)
UORFrange_conservation_subest_1 <- gscores(phast,UORFrange_sorted[71000:72000])
UORFrange_conservation_subest_2 <- gscores(phast,UORFrange_sorted[71000:100000])
UORFrange_conservation_subest_3 <- gscores(phast,UORFrange_sorted[71000:121454])
gscores(phast,UORFrange_sorted[71000])
GRanges object with 1 range and 6 metadata columns:
seqnames ranges strand | enstranscID type
<Rle> <IRanges> <Rle> | <character> <character>
[1] chr10 75568722-75568766 + | ENSMUST00000145928 oORF
Gene uorf_id conservation default
<character> <character> <numeric> <numeric>
[1] Ggt1 ENSMUST00000145928_89 0.0529412 0.415556
-------
seqinfo: 66 sequences (1 circular) from Genome Reference Consortium GRCm38 genome
head(UORFrange_conservation_subest_1)
GRanges object with 6 ranges and 6 metadata columns:
seqnames ranges strand | enstranscID type
<Rle> <IRanges> <Rle> | <character> <character>
[1] chr10 75568722-75568766 + | ENSMUST00000145928 oORF
[2] chr10 75568722-75568766 + | ENSMUST00000131565 oORF
[3] chr10 75568722-75568770 + | ENSMUST00000134503 nORF
[4] chr10 75568722-75568770 + | ENSMUST00000151212 nORF
[5] chr10 75568722-75568770 + | ENSMUST00000128886 nORF
[6] chr10 75568722-75568784 + | ENSMUST00000125770 nORF
Gene uorf_id conservation default
<character> <character> <numeric> <numeric>
[1] Ggt1 ENSMUST00000145928_89 0.05294118 0.415556
[2] Ggt1 ENSMUST00000131565_82 0.00666667 0.415556
[3] Ggt1 ENSMUST00000134503_67 0.04310345 0.381633
[4] Ggt1 ENSMUST00000151212_49 0.40000000 0.381633
[5] Ggt1 ENSMUST00000128886_61 0.00000000 0.381633
[6] Ggt1 ENSMUST00000125770_66 0.51538462 0.296825
-------
seqinfo: 66 sequences (1 circular) from Genome Reference Consortium GRCm38 genome
head(UORFrange_conservation_subest_2)
GRanges object with 6 ranges and 6 metadata columns:
seqnames ranges strand | enstranscID type
<Rle> <IRanges> <Rle> | <character> <character>
[1] chr10 75568722-75568766 + | ENSMUST00000145928 oORF
[2] chr10 75568722-75568766 + | ENSMUST00000131565 oORF
[3] chr10 75568722-75568770 + | ENSMUST00000134503 nORF
[4] chr10 75568722-75568770 + | ENSMUST00000151212 nORF
[5] chr10 75568722-75568770 + | ENSMUST00000128886 nORF
[6] chr10 75568722-75568784 + | ENSMUST00000125770 nORF
Gene uorf_id conservation default
<character> <character> <numeric> <numeric>
[1] Ggt1 ENSMUST00000145928_89 0.05294118 0.415556
[2] Ggt1 ENSMUST00000131565_82 0.00666667 0.415556
[3] Ggt1 ENSMUST00000134503_67 0.04310345 0.381633
[4] Ggt1 ENSMUST00000151212_49 0.40000000 0.381633
[5] Ggt1 ENSMUST00000128886_61 0.00000000 0.381633
[6] Ggt1 ENSMUST00000125770_66 0.51538462 0.296825
-------
seqinfo: 66 sequences (1 circular) from Genome Reference Consortium GRCm38 genome
head(UORFrange_conservation_subest_3)
GRanges object with 6 ranges and 6 metadata columns:
seqnames ranges strand | enstranscID type
<Rle> <IRanges> <Rle> | <character> <character>
[1] chr10 75568722-75568766 + | ENSMUST00000145928 oORF
[2] chr10 75568722-75568766 + | ENSMUST00000131565 oORF
[3] chr10 75568722-75568770 + | ENSMUST00000134503 nORF
[4] chr10 75568722-75568770 + | ENSMUST00000151212 nORF
[5] chr10 75568722-75568770 + | ENSMUST00000128886 nORF
[6] chr10 75568722-75568784 + | ENSMUST00000125770 nORF
Gene uorf_id conservation default
<character> <character> <numeric> <numeric>
[1] Ggt1 ENSMUST00000145928_89 0.05294118 0.415556
[2] Ggt1 ENSMUST00000131565_82 0.00666667 0.415556
[3] Ggt1 ENSMUST00000134503_67 0.04310345 0.381633
[4] Ggt1 ENSMUST00000151212_49 0.40000000 0.381633
[5] Ggt1 ENSMUST00000128886_61 0.00000000 0.381633
[6] Ggt1 ENSMUST00000125770_66 0.51538462 0.296825
-------
seqinfo: 66 sequences (1 circular) from Genome Reference Consortium GRCm38 genome
Additionally, if I check for the first 1000 values, they are all identical:
try1 <- head(UORFrange_conservation_subest_1$default, 1000)
try2 <- head(UORFrange_conservation_subest_2$default, 1000)
try3 <- head(UORFrange_conservation_subest_3$default, 1000)
max(abs(try1-try2), na.rm=TRUE)
[1] 0
max(abs(try2-try3), na.rm=TRUE)
[1] 0
max(abs(try1-try3), na.rm=TRUE)
[1] 0
Thanks! This update fixed everything. It was me manually sorting GRanges somehow not doing right. After updating, everything is going the right way.
Hi GenomicScores team,
I hope all is well.
I was having fun using your package which is really handy for quick conservation queries, and I found something unusual where when I check the gscores on a large GRanges, the results are not consistent as shown below. The column conservation is when I queried the whole GRange data.
I have attached my demo data here, please have a check. And many thanks for your reading!
Best wishes,
Tim