Closed dylanbeaudette closed 3 years ago
Possibly related, missing "OSD site" data result in a very strange data.frame
when read-back via soilDB::get_OSD('ACADEMY', result = 'json')
I think the missing OSD site data could be relatively easily fixed here... https://github.com/ncss-tech/SoilKnowledgeBase/blob/d7a55060d2f89522d2eaf6e35225c696c050c35c/R/parseOSD_functions.R#L209-L215
by using similar handlign of "site-data"
as is done for "hz-data"
in case when there are no columns in result
As far as the JSON itself, could we not store empty strings as placeholders for columns that had no parseable data? This would make many of the JSON files larger but should at least preserve the column names.
Reopening b/c only addressed the site-level stuff so far.
As far as the JSON itself, could we not store empty strings as placeholders for columns that had no parseable data? This would make many of the JSON files larger but should at least preserve the column names.
A good idea. Either way (NA
as string | all-NA columns stripped) we will have to use some extra steps when processing the JSON files in bulk. Encoding NA
as strings seems to work fine with mixed data (numeric, character, logical)--I wasn't sure if all types would be converted to character in the process. Might be simpler than attempting to fix structural problems with data.table::rbindlist()
or similar.
I'd like to test the effects of `toJSON(..., na = 'string') as applied to the horizon data a little more.
My preference would be to store all variables in the JSON as strings. Then get_OSD()
can handle type conversion and NA-handling. It is vectorized and with all OSD JSON having a complete structure the "fill
" argument to rbindlist()
should not be needed.
I tested na = 'string'
https://github.com/ncss-tech/SoilKnowledgeBase/pull/37 and it will work fine once get_OSD is set up to handle the resulting data.
Any reason to store numeric / logical as strings? It seems like the encoding / decoding of NA by to/fromJSON does this for us.
Wasn't sure, so had to try.
library(jsonlite)
x <- data.frame(
a = c(1:10, NA),
b = c(NA, letters[1:10]),
c = c(rep(TRUE, 5), NA, rep(FALSE, 5))
)
toJSON(x, pretty = TRUE)
fromJSON(toJSON(x, pretty = TRUE))
z <- fromJSON(toJSON(x, pretty = TRUE, na = 'string'))
str(z)
summary(z)
I looked at this... not quite right. Am I missing something?
x <- data.frame(
a = c(1:10, NA),
b = c(NA, letters[1:10]),
c = c(rep(TRUE, 5), NA, rep(FALSE, 5))
)
x2 <- jsonlite::fromJSON(jsonlite::toJSON(x, na = 'string', pretty = TRUE))
all.equal(x, x2)
#> [1] "Component \"b\": 'is.NA' value mismatch: 0 in current 1 in target"
That is expected: we only have to manually decode NA
in character class columns. NA
is correctly encoded/decoded for free. News to me.
I guess my concern was about cases where columns that would be numeric (say, pH) come back as all NA. This would be interpreted as logical, not numeric? Can we ensure that types will be stable when get_OSD() is combining results from multiple JSON files that may have varying patterns of NA?
Ah, I see. Good point. There is no way to suggest the kind of NA
.
We would have to try the entire pile to know for sure, but the following suggests that real data take precedence over NA
of unspecified type:
library(jsonlite)
x <- data.frame(
a = c(1:10, NA),
b = c(NA, letters[1:10]),
c = c(rep(TRUE, 5), NA, rep(FALSE, 5)),
d = NA
)
y <- data.frame(
a = c(1:10, NA),
b = NA,
c = c(rep(TRUE, 5), NA, rep(FALSE, 5)),
d = 1
)
z1 <- fromJSON(toJSON(x, pretty = TRUE, na = 'string'))
z2 <- fromJSON(toJSON(y, pretty = TRUE, na = 'string'))
z <- rbind(z1, z2)
str(z)
summary(z)
# order doesn't seem to matter
all.equal(str(z1, z2), str(z2, z1))
Looks good to me, and it seems that data.table::rbindlist()
has same behavior as rbind()
. I am not sure why the action didn't run on #37 before, but I manually triggered so we should see results of na = 'string'
on whole set soon https://github.com/ncss-tech/SoilKnowledgeBase/actions/runs/1308803220
I believe with the updates to use na = 'string'
and handling of "NA" and types in soilDB::get_OSD()
this issue is now resolved.
library(soilDB)
x <- get_OSD(c("AABAB", "ACADEMY", "ARPATUTU", "CANISROCKS", "CONEJO", "MUSICK", "STANISLAUS"))
#> Loading required namespace: jsonlite
do.call('rbind', x$HORIZONS) |> str()
#> 'data.frame': 44 obs. of 16 variables:
#> $ name : chr "Oi" "A1" "A2" "Bw1" ...
#> $ top : int 0 3 10 30 48 56 112 0 15 30 ...
#> $ bottom : int 3 10 30 48 56 112 155 15 30 51 ...
#> $ dry_hue : chr NA "10YR" "10YR" "10YR" ...
#> $ dry_value : int NA 5 6 6 7 7 7 4 3 4 ...
#> $ dry_chroma : int NA 4 4 4 2 3 2 3 3 3 ...
#> $ moist_hue : chr NA "10YR" "10YR" "10YR" ...
#> $ moist_value : int NA 3 4 4 5 5 5 3 3 3 ...
#> $ moist_chroma : int NA 4 4 4 3 4 1 2 2 4 ...
#> $ texture_class: chr NA "silt loam" "silt loam" "silty clay loam" ...
#> $ cf_class : chr NA NA NA NA ...
#> $ pH : num NA NA 5.4 5.4 5.6 5.4 5.6 NA NA NA ...
#> $ pH_class : chr NA "moderately acid" "strongly acid" "strongly acid" ...
#> $ distinctness : chr "abrupt" "clear" "clear" "abrupt" ...
#> $ topography : chr "smooth" "smooth" "wavy" "wavy" ...
#> $ narrative : chr "Oi--0 to 1 inch; slightly decomposed deciduous leaves and twigs; abrupt smooth boundary." "A1--1 to 4 inches; dark yellowish brown (10YR 3/4) medial silt loam, yellowish brown (10YR 5/4) dry; moderate m"| __truncated__ "A2--4 to 12 inches; dark yellowish brown (10YR 4/4) medial silt loam, light yellowish brown (10YR 6/4) dry; mod"| __truncated__ "Bw1--12 to 19 inches; dark yellowish brown (10YR 4/4) silty clay loam, light yellowish brown (10YR 6/4) dry; fe"| __truncated__ ...
I did a bunch of testing yesterday afternoon--all good. Thanks.
It seem that
jsonLite::toJSON()
automatically drops columns which are all NA. This is news to me and the obvious solutions (null
andna
arguments) don't work. Settingna = 'string'
causes NA to be encoded as characterNA
.Example from an OSD that has no coarse fraction modifiers (
cf_class
is allNA
).This isn't a show-stopping bug, but it does mean that 1-at-a-time processing of the current JSON files may give slightly different structured output at each iteration; especially when there are typos or formatting errors.