Closed brownag closed 4 years ago
Great idea. I don't think that all that many people are using aqp::unique
, it doesn't come up often in my own work, and there might only be 2 cases where I have used it in documentation. So--totally fine with breaking-style changes to support aqp::union
.
Thinking about this some more, it would be simpler if the input were always a list of SPCs, possibly generated by aqp::split
or collated from different sources. We could do some magic by which the first argument is either SPC or list of SPC. I kind of like that option.
Again, this a good time / way to break the current implementation of aqp::unique
. More ideas:
ignore.list
: chances are that it will be simpler to list those variables that matter vs. exclusion of all that don'tFortunately, this will find most use in the evaluation of copies returned by fetchNASIS
, SDA_query
, fetchKSSL
and so on: e.g. we are removing exact copies.
One thing that we may have to do for the SPC or list of SPC as input: aqp::split
must be able to perform an "identity" split, e.g. no grouping factor (#121). I might implement that right now.
Thanks @cmrobinson07 and @smroecker for looping me in on the Texas mollic epipedon discussions. My review of the needs and topics there has led me back to an aqp
issue I made a while back. More on that elsewhere.
Here is approximately the function that I wanted this issue, #129, to address: lunique
lunique
takes a list of SPCs -- e.g. from batch application of an SDA query or fetchKSSL
or similar -- and "cleans" it such that it will successfully aqp::union
.
# UPDATED (2020/05/29)
# experimental: new function to clean list input where dupes exist (that prevent union)
lunique <- function(l) {
# calculate profile IDs for each SPC element in l
l.pid <- lapply(l, function(x) {
if(!inherits(x, 'SoilProfileCollection')) {
return(NA)
}
profile_id(x)
})
# keep track of all-NA and length (number of profiles per set)
l.na <- as.logical(unlist(lapply(l.pid, function(x) all(is.na(x)))))
l.n <- lapply(l.pid, length)
# make data frame of profile ID and input list index
df <- data.frame(pid = do.call('c', l.pid),
idx = do.call('c', lapply(1:length(l.n),
function(n) rep(n, l.n[n]))))
# squash non-uniques in combined data.frame, based on profile ID
df <- df[order(df$pid),]
d.out <- do.call('rbind', lapply(split(df, df$pid), function(d) {
# note diagnostics on d can be helpful to ID type of non-uniqueness
return(d[1,])
}))
# split back into "original" list elements
d.split <- split(d.out, d.out$idx)
l.out <- lapply(1:length(l.na), function(k) {
if(!l.na[k]) {
needle <- l.pid[[k]]
haystack <- d.split[[as.character(k)]]
ids <- which(needle %in% haystack$pid)
if(length(ids) == 0)
return(NA)
return(l[[k]][ids, ])
} else {
return(NA)
}
})
return(l.out)
}
I tested it out on SDA component data retrieved via fetchSDA
by areasymbol for the entire state of texas.
The "test" is essentially this:
all(unique(do.call('c', lapply(res, profile_id))) == profile_id(spc))
where res is the list of SPCs result (returned from lapply(..., fetchSDA(...))
and spc
is the SPC after calling union(lunique(res))
I implemented a slightly-more-fortified version than the one I first posted along with some fairly rigorous tests. The first version I posted would not handle degenerate cases of all-NA handling.
aae0a5e
Now, instead of erroring out, it re-inserts NA
into the correct positions, so the input and output list elements (and length) match -- containing only the unique values in their original group.
I am not sure the name is good -- but it is kind of fun to (try) to say -- and it seems like a useful operation to be able to do. If needed could move to soilDB.
- do we include special slots (I think so)
Yes -- I will look into this
- do we include diagnostic features (maybe not)
Mmm.. indeed this becomes difficult. There could be just about anything in the diagnostic / restriction slots.
- maybe flip the logic on
ignore.list
: chances are that it will be simpler to list those variables that matter vs. exclusion of all that don't
Well, the heart of my suggestion was to flip the logic on the default that we already have -- which is what you suggested.
With the work you have done to vectorize fetchKSSL, and fixing union to work with possibly NULL
results, and the above lunique
method for cleaning a list... I suppose I can relatively implement complete.unique
if I need it using the above code.
I disagree that it is easier to specify the variables I want to consider for things like component or pedon data from NASIS or SDA or KSSL... but I do agree that in most cases a reasonable determination of uniqueness can be determined from a subset of the data.
If I need it I will use this:
complete.unique <- function(p, ignore.list=c("")) {
p[aqp::unique(p, vars=names(p)[!names(p) %in% ignore.list,]
}
This type of thing still does need to be fixed
library(aqp)
data(sp5)
sp5$newhzid <- rev(1:nrow(sp5))
hzidname(sp5) <- "newhzid"
hzidname(unique(sp5, vars = c("soil","hzID")))
EDIT: apparently was having trouble copying and pasting and reading
hzidname(aqp::unique(sp5, vars = c("id","hzID"))) Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘hzidname’ for signature ‘"integer"’
We should also consider importing rbind.data.table(..., fill = TRUE)
from data.table
to replace plyr::rbind.fill
in aqp::union
https://gist.github.com/brownag/b58ce5a090f65ff50a5b58a5284578b8
If this sounds good, I could add rbind
to the list of imports in aqpdf
and make the replacement in union
...
I have a lot of commits off the main branch to merge into mine, so maybe we can hold off and just know that the following is possible and probably superior to plyr.
EDIT: add rbindlist
to benchmark -- as it is actually a direct replacement for do.call('rbind.fill', ...)
and has a fill
argument of its own!
This is rbind-filling two (200 column x 1000 row) data.tables/frames.
# Unit: milliseconds
# expr min lq mean median uq max neval
# data.table 5.401874 5.752023 7.052155 5.865167 6.045043 22.52242 100
# as.data.table 9.842593 10.114694 11.703388 10.313812 11.054234 29.69442 100
# data.table::rbindlist 5.419195 5.691052 7.322580 5.866808 6.099170 24.41400 100
# data.table::rbindlist + coercion 5.447201 5.662593 7.680818 5.867926 6.308414 23.63164 100
# plyr 49.178042 49.744839 56.202575 50.105359 50.856851 418.81422 100
@brownag what?
library(aqp)
data(sp5)
sp5$newhzid <- rev(1:nrow(sp5))
hzidname(sp5) <- "newhzid"
hzidname(unique(sp5, vars = c("soil","hzID")))
That cannot work because the current aqp::unique
returns an integer index to unique profiles within a collection. The error is expected.
Let's split this thread into union / unique topics.
We should also consider importing
rbind.data.table(..., fill = TRUE)
fromdata.table
to replaceplyr::rbind.fill
inaqp::union
Good call. I'm 100% for replacement of plyr
and reshape
with data.table
. Lets do this after the merge of aqpdf
branch.
Related issues:
Deleted my previous comment. It is confusing and my code was confused. I was only trying to point out that changing what is perceived as typical behavior causes confusion -- and wound up being the guinea pig.
See PR for aqpdf to bring in the initial data.table support. we can add rbindlist etc. and tag the other issues. This issue is closed.
I think we should have a (not default) option to take the first instance of a unique profile ID in the presence of conflicts within the elements of input list to
aqp::union
. The default behavior of returning an error with non-unique should be retained.I come across this when trying to union specific combinations of MLRAs obtained via
fetchKSSL
-- but there are other instances where there can be collisions in profile ID or reasons why duplication might need to be reduced by some means.To do this I think we should implement changes to
aqp::unique
(or a new method) to run checks on a single SPC, or a list of SPCs (i.e. before unioning), and return either a unique SPC or a summary of the conflicts.The result would ideally be an SPC which would allow it to be used in
magrittr
pipelines. One option toaqp::unique
could simply run some set of uniqueness diagnostics and return them in adata.frame
.Default result I think should be something like this (using existing code) to return pedons with completely unique data:
Key change is rather than a vector of IDs, by some means we should be able to return an SPC, or list of SPCs, containing just the first instances found. This would be first instance across all lists, in case of list input -- essentially "cleaning" the list for
aqp::union
.The columns specified in an
ignore.list
argument would be excluded fromvars
before hashing individual profiles to remove them from thee uniqueness constraint.So for instance, if you wanted to deal with copied components you could drop the uniqueness constraints on
idname
andhzidname
(i.e.coiid
andchiid
) -- and just look for matching data.There may be some other columns that would have to be dropped in practice.
I know this flips the unique logic around a bit -- but this is something that has been in the back of my mind a while.