ropensci / jqr

R interface to jq
https://docs.ropensci.org/jqr
Other
143 stars 13 forks source link

protection stack overflow errors on medium(?)-sized inputs #76

Open mmuurr opened 5 years ago

mmuurr commented 5 years ago

When using jqr on largish inputs, I'm finding frequent protection stack overflow errors. Here's a simple reprex with 100,000 input strings (on my machine):

> foo <- replicate(100e3, sprintf('{"a":"A","b":"B","c":"C"}')) %>%
    jqr::jq("{a,b,c}")
Error: protect(): protection stack overflow

With 10,000 inputs, no error:

> replicate(10e3, sprintf('{"a":"A","b":"B","c":"C"}')) %>%
    jqr::jq("{a,b,c}") %>%
    str()
'jqson' chr [1:10000] "{\"a\":\"A\",\"b\":\"B\",\"c\":\"C\"}" "{\"a\":\"A\",\"b\":\"B\",\"c\":\"C\"}" "{\"a\":\"A\",\"b\":\"B\",\"c\":\"C\"}" ...

Any ideas?

Session Info ``` Session info --------------------------------------------------------------------------------------------------------------------------------------------------------------------- setting value version R version 3.5.1 (2018-07-02) system x86_64, darwin16.7.0 ui unknown language (EN) collate en_US.UTF-8 tz America/Denver date 2019-01-18 Packages ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- package * version date source assertthat 0.2.0 2017-04-11 CRAN (R 3.5.1) backports 1.1.2 2017-12-13 CRAN (R 3.5.1) base * 3.5.1 2018-07-03 local bindr 0.1.1 2018-03-13 CRAN (R 3.5.1) bindrcpp 0.2.2 2018-03-29 CRAN (R 3.5.1) broom 0.5.0 2018-07-17 CRAN (R 3.5.1) cellranger 1.1.0 2016-07-27 CRAN (R 3.5.1) cli 1.0.0 2017-11-05 CRAN (R 3.5.1) colorspace 1.3-2 2016-12-14 CRAN (R 3.5.1) compiler 3.5.1 2018-07-03 local craftsy.utils 1.3.1 2018-08-29 local crayon 1.3.4 2017-09-16 CRAN (R 3.5.1) datasets * 3.5.1 2018-07-03 local devtools 1.13.6 2018-06-27 CRAN (R 3.5.1) digest 0.6.18 2018-10-10 cran (@0.6.18) dplyr * 0.7.6 2018-06-29 CRAN (R 3.5.1) forcats * 0.3.0 2018-02-19 CRAN (R 3.5.1) ggplot2 * 3.0.0 2018-07-03 CRAN (R 3.5.1) glue 1.3.0 2018-07-17 CRAN (R 3.5.1) graphics * 3.5.1 2018-07-03 local grDevices * 3.5.1 2018-07-03 local grid 3.5.1 2018-07-03 local gtable 0.2.0 2016-02-26 CRAN (R 3.5.1) haven 1.1.2 2018-06-27 CRAN (R 3.5.1) hms 0.4.2 2018-03-10 CRAN (R 3.5.1) httr 1.3.1 2017-08-20 CRAN (R 3.5.1) jqr 1.1.0.9100 2019-01-18 Github (ropensci/jqr@4a43703) jsonlite 1.5 2017-06-01 CRAN (R 3.5.1) lattice 0.20-35 2017-03-25 CRAN (R 3.5.1) lazyeval 0.2.1 2017-10-29 CRAN (R 3.5.1) lubridate 1.7.4 2018-04-11 CRAN (R 3.5.1) magrittr * 1.5 2014-11-22 CRAN (R 3.5.1) memoise 1.1.0 2017-04-21 CRAN (R 3.5.1) methods * 3.5.1 2018-07-03 local modelr 0.1.2 2018-05-11 CRAN (R 3.5.1) munsell 0.5.0 2018-06-12 CRAN (R 3.5.1) nlme 3.1-137 2018-04-07 CRAN (R 3.5.1) pillar 1.3.0 2018-07-14 CRAN (R 3.5.1) pkgconfig 2.0.2 2018-08-16 CRAN (R 3.5.1) plyr 1.8.4 2016-06-08 CRAN (R 3.5.1) purrr * 0.2.5 2018-05-29 CRAN (R 3.5.1) R6 2.2.2 2017-06-17 CRAN (R 3.5.1) Rcpp 0.12.18 2018-07-23 CRAN (R 3.5.1) readr * 1.1.1 2017-05-16 CRAN (R 3.5.1) readxl 1.1.0 2018-04-20 CRAN (R 3.5.1) rlang 0.2.2 2018-08-16 CRAN (R 3.5.1) rstudioapi 0.7 2017-09-07 CRAN (R 3.5.1) rvest 0.3.2 2016-06-17 CRAN (R 3.5.1) scales 1.0.0 2018-08-09 CRAN (R 3.5.1) stats * 3.5.1 2018-07-03 local stringi 1.2.4 2018-07-20 CRAN (R 3.5.1) stringr * 1.3.1 2018-05-10 CRAN (R 3.5.1) tibble * 1.4.2 2018-01-22 CRAN (R 3.5.1) tidyr * 0.8.1 2018-05-18 CRAN (R 3.5.1) tidyselect 0.2.4 2018-02-26 CRAN (R 3.5.1) tidyverse * 1.2.1 2017-11-14 CRAN (R 3.5.1) tools 3.5.1 2018-07-03 local utils * 3.5.1 2018-07-03 local withr 2.1.2 2018-03-15 CRAN (R 3.5.1) xml2 1.2.0 2018-01-24 CRAN (R 3.5.1) ```
sckott commented 5 years ago

Thanks for the report @mmuurr

We came upon this recently, see https://github.com/ropensci/geojson/issues/36

The answer is essentially that you're pushing too much data in at once, so try to push in smaller chunks if possible. Is it possible in your case?

@jeroen With this example that 100K length JSON works fine with jq on the cli, so is there anything we can do change this? If not, maybe we can help users split up json into chunks and then re-combine. Would work if it's like in the example above where each element in a vector is valid JSON, but not so easy otherwise

mmuurr commented 5 years ago

@sckott, yeah I'm accommodating for now by chunking inputs (e.g. via readr::read_lines_chunked), but I thought I'd just raise the issue for awareness :-)

sckott commented 5 years ago

glad you can break it up. we'll see what Jeroen says.

mmuurr commented 3 years ago

Hi there, just a polite re-surfacing of this issue, which I've run into again. I think breaking up long JSON (character) vectors into chunks works just great and designing a simple wrapper to do so that then recombines the jqr results is indeed relatively easy. I'm wondering if there's:

  1. Any guidance on what that appropriate chunk size would be in number of strings (i.e. length of vector) and/or
  2. If the chunking should be determined by total byte size of the chunks, which adds some (albeit small) complexity to the wrappers.

Also should such a wrapper be integrated directly into jqr? (If so, I'd be happy to take a first stab at that wrapper and create a PR, though I'll pass on that effort if y'all don't believe it should be part of the package).

And if no chunking wrapper built-in, should jqr catch that specific type of error and update the user with 'advice' (i.e. "hey user, try chunking")?

sckott commented 3 years ago

Thanks @mmuurr - sorry for the delay on this.

That makes sense that it's 10K, since
https://github.com/stedolan/jq/blob/9b51a0852a0f91fbc987f5f2b302ff65e22f6399/src/parser.c#L1692 via https://github.com/stedolan/jq/issues/1054 and https://github.com/stedolan/jq/issues/1041

I think a wrapper belongs here in the package.

byte size does seem like it would be more appropriate.

can you send a PR and we can discuss from there

DataStrategist commented 9 months ago

I am also experiencing this error when I feed more than 50k json strings into jq to process. I can chunk the data, of course, but is a bit disruptive in my example.