Unable to read remote bed.bgz

bschilder commented 2 years ago

Hello,

I'm trying to query subsets of Roadmap bed.gz files. However there seems to be some issues here:

Reprex

Code

URL <- "https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E099_15_coreMarks_dense.bed.bgz"
res <- rtracklayer::import(URL)

## same error when using `which`
gr <- GenomicRanges::GRanges("chr4:14737349-16737284")
res <- rtracklayer::import(URL, which=gr)

Console output

Error in .new_IRanges_from_start_end(start, end) : 
  'start' or 'end' cannot contain NAs
In addition: Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :

Traceback

Error in .new_IRanges_from_start_end(start, end) : 
'start' or 'end' cannot contain NAs
11.
stop(wmsg("'start' or 'end' cannot contain NAs"))
10.
.new_IRanges_from_start_end(start, end)
9.
.new_IRanges(start = start, end = end, width = width)
8.
IRanges(bed$thickStart + 1L, thickEnd)
7.
.local(con, format, text, ...)
6.
import(con, ...)
5.
import(con, ...)
4.
import(FileForFormat(con), ...)
3.
import(FileForFormat(con), ...)
2.
rtracklayer::import(URL)
1.
rtracklayer::import(URL)

Session info

``` R version 4.2.1 (2022-06-23) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Monterey 12.4 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] utf8_1.2.2 reticulate_1.25 R.utils_2.12.0 [4] tidyselect_1.1.2 RSQLite_2.2.16 AnnotationDbi_1.58.0 [7] htmlwidgets_1.5.4 grid_4.2.1 BiocParallel_1.30.3 [10] XGR_1.1.8 munsell_0.5.0 codetools_0.2-18 [13] interp_1.1-3 DT_0.24 colorspace_2.0-3 [16] OrganismDbi_1.38.1 Biobase_2.56.0 filelock_1.0.2 [19] knitr_1.40 supraHex_1.34.0 rstudioapi_0.14 [22] stats4_4.2.1 DescTools_0.99.45 MatrixGenerics_1.8.1 [25] GenomeInfoDbData_1.2.8 bit64_4.0.5 echoconda_0.99.7 [28] basilisk_1.8.1 vctrs_0.4.1 generics_0.1.3 [31] xfun_0.32 biovizBase_1.44.0 BiocFileCache_2.4.0 [34] R6_2.5.1 GenomeInfoDb_1.32.3 AnnotationFilter_1.20.0 [37] bitops_1.0-7 cachem_1.0.6 reshape_0.8.9 [40] DelayedArray_0.22.0 assertthat_0.2.1 BiocIO_1.6.0 [43] scales_1.2.1 nnet_7.3-17 rootSolve_1.8.2.3 [46] gtable_0.3.0 lmom_2.9 ggbio_1.44.1 [49] ensembldb_2.20.2 rlang_1.0.4 echodata_0.99.12 [52] splines_4.2.1 lazyeval_0.2.2 rtracklayer_1.57.0 [55] dichromat_2.0-0.1 hexbin_1.28.2 checkmate_2.1.0 [58] BiocManager_1.30.18 yaml_2.3.5 reshape2_1.4.4 [61] GenomicFeatures_1.48.3 ggnetwork_0.5.10 backports_1.4.1 [64] Hmisc_4.7-1 RBGL_1.72.0 tools_4.2.1 [67] ggplot2_3.3.6 ellipsis_0.3.2 RColorBrewer_1.1-3 [70] proxy_0.4-27 BiocGenerics_0.42.0 Rcpp_1.0.9 [73] plyr_1.8.7 base64enc_0.1-3 progress_1.2.2 [76] zlibbioc_1.42.0 purrr_0.3.4 RCurl_1.98-1.8 [79] basilisk.utils_1.8.0 prettyunits_1.1.1 rpart_4.1.16 [82] deldir_1.0-6 S4Vectors_0.34.0 SummarizedExperiment_1.26.1 [85] ggrepel_0.9.1 cluster_2.1.4 fs_1.5.2 [88] crul_1.2.0 magrittr_2.0.3 data.table_1.14.2 [91] echotabix_0.99.7 dnet_1.1.7 openxlsx_4.2.5 [94] mvtnorm_1.1-3 ProtGenerics_1.28.0 matrixStats_0.62.0 [97] patchwork_1.1.2 hms_1.1.2 evaluate_0.16 [100] XML_3.99-0.10 jpeg_0.1-9 readxl_1.4.1 [103] IRanges_2.30.1 gridExtra_2.3 compiler_4.2.1 [106] biomaRt_2.52.0 tibble_3.1.8 crayon_1.5.1 [109] R.oo_1.25.0 htmltools_0.5.3 echoannot_0.99.7 [112] tzdb_0.3.0 Formula_1.2-4 tidyr_1.2.0 [115] expm_0.999-6 Exact_3.1 DBI_1.1.3 [118] dbplyr_2.2.1 MASS_7.3-58.1 rappdirs_0.3.3 [121] boot_1.3-28 Matrix_1.4-1 readr_2.1.2 [124] piggyback_0.1.4 cli_3.3.0 R.methodsS3_1.8.2 [127] parallel_4.2.1 igraph_1.3.4 GenomicRanges_1.48.0 [130] pkgconfig_2.0.3 GenomicAlignments_1.32.1 dir.expiry_1.4.0 [133] RCircos_1.2.2 foreign_0.8-82 osfr_0.2.8 [136] xml2_1.3.3 XVector_0.36.0 stringr_1.4.1 [139] VariantAnnotation_1.42.1 digest_0.6.29 graph_1.74.0 [142] httpcode_0.3.0 Biostrings_2.64.1 rmarkdown_2.16 [145] cellranger_1.1.0 htmlTable_2.4.1 gld_2.6.5 [148] restfulr_0.0.15 curl_4.3.2 Rsamtools_2.12.0 [151] rjson_0.2.21 lifecycle_1.0.1 nlme_3.1-159 [154] jsonlite_1.8.0 BSgenome_1.64.0 fansi_1.0.3 [157] pillar_1.8.1 lattice_0.20-45 GGally_2.1.2 [160] KEGGREST_1.36.3 fastmap_1.1.0 httr_1.4.4 [163] survival_3.4-0 glue_1.6.2 zip_2.2.0 [166] png_0.1-7 bit_4.0.4 Rgraphviz_2.40.0 [169] class_7.3-20 stringi_1.7.8 blob_1.2.3 [172] latticeExtra_0.6-30 memoise_2.0.1 dplyr_1.0.9 [175] e1071_1.7-11 ape_5.6-2 ```

Many thanks in advance, Brian

bschilder commented 2 years ago

It seems that the issue only occurs when the file is remote. Downloading the entire file first does technically circumvent the issue, though this approach isn't ideal:

URL <- "https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E099_15_coreMarks_dense.bed.bgz"

tmp <- file.path(tempfile(),basename(URL))
dir.create(dirname(tmp),showWarnings = FALSE, recursive = TRUE)
utils::download.file(url = URL, destfile = tmp)
res <- rtracklayer::import(tmp)

Screenshot 2022-09-04 at 15 56 36

bschilder commented 1 year ago

@lawremi @sanchit-saini, this error is still persisting. Do you know what might be happening here?

bschilder commented 1 year ago

@hpages I know you're not the primary maintainer for this repo, but I was wondering if you nevertheless have any ideas about this. Thank you!

hpages commented 1 year ago

Hi @bschilder , I didn't investigate much but this is how rtracklayer::import() reads this remote BED file:

URL <- "https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E099_15_coreMarks_dense.bed.bgz"
con1 <- gzcon(url(URL, open="rb"), text=TRUE)
bedNames <- c("chrom", "start", "end", "name", "score", "strand", "thickStart", "thickEnd", "itemRgb")
bedClasses <- c("character", "integer", "integer", "character", "numeric", "character", "integer", "integer", "character")
bed1 <- read.table(con1, colClasses=bedClasses, as.is=TRUE, na.strings=".", comment.char="", sep="\t", quote="")
# Warning message:
# In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
#   number of items read is not a multiple of the number of columns
colnames(bed1) <- bedNames

Major problem is that bed1 only has 1127 rows!

dim(bed1)
# [1] 1127    9

Minor problem is that the last row has an unexpected NA in the thickEnd column:

tail(bed1)
#      chrom   start     end     name score strand thickStart thickEnd
# 1122 chr10 8459800 8462200   5_TxWk     0   <NA>    8459800  8462200
# 1123 chr10 8462200 8462400    7_Enh     0   <NA>    8462200  8462400
# 1124 chr10 8462400 8477200 15_Quies     0   <NA>    8462400  8477200
# 1125 chr10 8477200 8483600    7_Enh     0   <NA>    8477200  8483600
# 1126 chr10 8483600 8491600   5_TxWk     0   <NA>    8483600  8491600
# 1127 chr10 8491600 8492800    7_Enh     0   <NA>     849160       NA
#          itemRgb
# 1122     0,100,0
# 1123   255,255,0
# 1124 255,255,255
# 1125   255,255,0
# 1126     0,100,0
# 1127

Compare this to what rtracklayer::import() does to read the file locally:

con2 <- gzfile("E099_15_coreMarks_dense.bed.bgz", open="r")
bed2 <- read.table(con2, colClasses=bedClasses, as.is=TRUE, na.strings=".", comment.char="", sep="\t", quote="")
colnames(bed2) <- bedNames

dim(bed2)
# [1] 265577      9

anyNA(bed2$thickEnd)
# [1] FALSE

So it looks to me that read.table() is somehow broken on a connection obtained with gzcon(url(URL, open="rb"), text=TRUE).

bschilder commented 1 year ago

Hi @hpages thanks for taking a look at this, this definitely gives me some clues.

I actually found the below approach does a decent job. Of course, this is just a temporary solution as it still has to download the whole file first, which is less efficient than making a query. But at least this approach consistently reads in all 265,577 rows , and can handle situations where there's some NAs.

path <- "https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E099_15_coreMarks_dense.bed.bgz"

tmp <- file.path(tempdir(), basename(path))
        ## Only download it if you have to
        if(!file.exists(tmp)){
            utils::download.file(url = path, 
                                 destfile = tmp) 
        } 
 dat <- data.table::fread(text=readLines(con = tmp))
gr <- GenomicRanges::makeGRangesFromDataFrame(df = dat, seqnames.field = "V1", start.field = "V2", end.field = "V3", keep.extra.columns = TRUE)

Minor problem is that the last row has an unexpected NA in the thickEnd column:

rtracklayer seems to get hung up on the NA. This doesn't seem to be an issue with creating the GRanges object manually (see above example), so i'm wondering if there's a way to have rtracklayer::import simply throw a warning instead of an error? @sanchit-saini

hpages commented 1 year ago

i'm wondering if there's a way to have rtracklayer::import simply throw a warning instead of an error?

That doesn't solve the real problem which is that the remote file doesn't contain any NA so there shouldn't be one in the data.frame obtained by importing the file.

One workaround is to not try to import the remote file directly, but to download it to a temporary file first, like you did manually, except that this would need to be baked in rtracklayer::import().

FWIW I adopted that approach in GenomeInfoDb::getChromInfoFromNCBI() last year (see https://github.com/Bioconductor/GenomeInfoDb/commit/58c8310ca45791c8ca5217df3598be9f23cb3168). Works a lot better now!

bschilder commented 1 year ago

i'm wondering if there's a way to have rtracklayer::import simply throw a warning instead of an error?

That doesn't solve the real problem which is that the remote file doesn't contain any NA so there shouldn't be one in the data.frame obtained by importing the file.

Ah ok, I see what you mean. Very strange that the NA gets introduced somehow.

Screenshot 2022-10-26 at 10 03 47

One workaround is to not try to import the remote file directly, but to download it to a temporary file first, like you did manually, except that this would need to be baked in rtracklayer::import(). FWIW I adopted that approach in GenomeInfoDb::getChromInfoFromNCBI() last year (see Bioconductor/GenomeInfoDb@58c8310). Works a lot better now!

Nice! Thanks for sharing this. @sanchit-saini @lawremi something along these lines would be a great feature forrtracklayer

lawremi / rtracklayer