ncss-tech / aqp

Algorithms for Quantitative Pedology
http://ncss-tech.github.io/aqp/
54 stars 14 forks source link

aqp::union and aqp::unique enhancements #129

Closed brownag closed 4 years ago

brownag commented 4 years ago

I think we should have a (not default) option to take the first instance of a unique profile ID in the presence of conflicts within the elements of input list to aqp::union. The default behavior of returning an error with non-unique should be retained.

I come across this when trying to union specific combinations of MLRAs obtained via fetchKSSL -- but there are other instances where there can be collisions in profile ID or reasons why duplication might need to be reduced by some means.

To do this I think we should implement changes to aqp::unique (or a new method) to run checks on a single SPC, or a list of SPCs (i.e. before unioning), and return either a unique SPC or a summary of the conflicts.

The result would ideally be an SPC which would allow it to be used in magrittr pipelines. One option to aqp::unique could simply run some set of uniqueness diagnostics and return them in a data.frame.

Default result I think should be something like this (using existing code) to return pedons with completely unique data:

complete.unique <- function(p, ignore.list=c("")) {
 p[aqp::unique(p, vars=names(p)[!names(p) %in% ignore.list,]
}

Key change is rather than a vector of IDs, by some means we should be able to return an SPC, or list of SPCs, containing just the first instances found. This would be first instance across all lists, in case of list input -- essentially "cleaning" the list for aqp::union.

The columns specified in an ignore.list argument would be excluded from vars before hashing individual profiles to remove them from thee uniqueness constraint.

So for instance, if you wanted to deal with copied components you could drop the uniqueness constraints on idname and hzidname (i.e. coiid and chiid) -- and just look for matching data.

There may be some other columns that would have to be dropped in practice.

I know this flips the unique logic around a bit -- but this is something that has been in the back of my mind a while.

dylanbeaudette commented 4 years ago

Great idea. I don't think that all that many people are using aqp::unique, it doesn't come up often in my own work, and there might only be 2 cases where I have used it in documentation. So--totally fine with breaking-style changes to support aqp::union.

Thinking about this some more, it would be simpler if the input were always a list of SPCs, possibly generated by aqp::split or collated from different sources. We could do some magic by which the first argument is either SPC or list of SPC. I kind of like that option.

dylanbeaudette commented 4 years ago

Again, this a good time / way to break the current implementation of aqp::unique. More ideas:

Fortunately, this will find most use in the evaluation of copies returned by fetchNASIS, SDA_query, fetchKSSL and so on: e.g. we are removing exact copies.

One thing that we may have to do for the SPC or list of SPC as input: aqp::split must be able to perform an "identity" split, e.g. no grouping factor (#121). I might implement that right now.

brownag commented 4 years ago

Thanks @cmrobinson07 and @smroecker for looping me in on the Texas mollic epipedon discussions. My review of the needs and topics there has led me back to an aqp issue I made a while back. More on that elsewhere.

Here is approximately the function that I wanted this issue, #129, to address: lunique

lunique takes a list of SPCs -- e.g. from batch application of an SDA query or fetchKSSL or similar -- and "cleans" it such that it will successfully aqp::union.

# UPDATED (2020/05/29)
#  experimental: new function to clean list input where dupes exist (that prevent union)
lunique <- function(l) {

  # calculate profile IDs for each SPC element in l
  l.pid <- lapply(l, function(x) {
    if(!inherits(x, 'SoilProfileCollection')) {
      return(NA)
    }
    profile_id(x)
  })

  # keep track of all-NA and length (number of profiles per set)
  l.na <- as.logical(unlist(lapply(l.pid, function(x) all(is.na(x)))))
  l.n <- lapply(l.pid, length)

  # make data frame of profile ID and input list index
  df <- data.frame(pid = do.call('c', l.pid),
                   idx = do.call('c', lapply(1:length(l.n),
                                             function(n) rep(n, l.n[n]))))

  # squash non-uniques in combined data.frame, based on profile ID
  df <- df[order(df$pid),]
  d.out <- do.call('rbind', lapply(split(df, df$pid), function(d) {
    # note diagnostics on d can be helpful to ID type of non-uniqueness
    return(d[1,])
  }))

  # split back into "original" list elements
  d.split <- split(d.out, d.out$idx)
  l.out <- lapply(1:length(l.na), function(k) {
    if(!l.na[k]) {
      needle <- l.pid[[k]]
      haystack <- d.split[[as.character(k)]]
      ids <- which(needle %in% haystack$pid)
      if(length(ids) == 0)
        return(NA)
      return(l[[k]][ids, ])
    } else { 
      return(NA)
    }
  })
  return(l.out)
}

I tested it out on SDA component data retrieved via fetchSDA by areasymbol for the entire state of texas.

The "test" is essentially this:

all(unique(do.call('c', lapply(res, profile_id))) == profile_id(spc))

where res is the list of SPCs result (returned from lapply(..., fetchSDA(...)) and spc is the SPC after calling union(lunique(res))

brownag commented 4 years ago

I implemented a slightly-more-fortified version than the one I first posted along with some fairly rigorous tests. The first version I posted would not handle degenerate cases of all-NA handling.

aae0a5e

Now, instead of erroring out, it re-inserts NA into the correct positions, so the input and output list elements (and length) match -- containing only the unique values in their original group.

I am not sure the name is good -- but it is kind of fun to (try) to say -- and it seems like a useful operation to be able to do. If needed could move to soilDB.

brownag commented 4 years ago
  • do we include special slots (I think so)

Yes -- I will look into this

  • do we include diagnostic features (maybe not)

Mmm.. indeed this becomes difficult. There could be just about anything in the diagnostic / restriction slots.

  • maybe flip the logic on ignore.list: chances are that it will be simpler to list those variables that matter vs. exclusion of all that don't

Well, the heart of my suggestion was to flip the logic on the default that we already have -- which is what you suggested.

With the work you have done to vectorize fetchKSSL, and fixing union to work with possibly NULL results, and the above lunique method for cleaning a list... I suppose I can relatively implement complete.unique if I need it using the above code.

I disagree that it is easier to specify the variables I want to consider for things like component or pedon data from NASIS or SDA or KSSL... but I do agree that in most cases a reasonable determination of uniqueness can be determined from a subset of the data.

If I need it I will use this:

complete.unique <- function(p, ignore.list=c("")) {
 p[aqp::unique(p, vars=names(p)[!names(p) %in% ignore.list,]
}
brownag commented 4 years ago

This type of thing still does need to be fixed

library(aqp)
data(sp5)
sp5$newhzid <- rev(1:nrow(sp5))
hzidname(sp5) <- "newhzid"
hzidname(unique(sp5, vars = c("soil","hzID")))

EDIT: apparently was having trouble copying and pasting and reading

hzidname(aqp::unique(sp5, vars = c("id","hzID"))) Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘hzidname’ for signature ‘"integer"’

brownag commented 4 years ago

We should also consider importing rbind.data.table(..., fill = TRUE) from data.table to replace plyr::rbind.fill in aqp::union

https://gist.github.com/brownag/b58ce5a090f65ff50a5b58a5284578b8

If this sounds good, I could add rbind to the list of imports in aqpdf and make the replacement in union...

I have a lot of commits off the main branch to merge into mine, so maybe we can hold off and just know that the following is possible and probably superior to plyr.

EDIT: add rbindlist to benchmark -- as it is actually a direct replacement for do.call('rbind.fill', ...) and has a fill argument of its own!

This is rbind-filling two (200 column x 1000 row) data.tables/frames.

# Unit: milliseconds
# expr       min        lq      mean    median        uq       max neval
# data.table  5.401874  5.752023  7.052155  5.865167  6.045043  22.52242   100
# as.data.table  9.842593 10.114694 11.703388 10.313812 11.054234  29.69442   100
# data.table::rbindlist  5.419195  5.691052  7.322580  5.866808  6.099170  24.41400   100
# data.table::rbindlist + coercion  5.447201  5.662593  7.680818  5.867926  6.308414  23.63164   100
# plyr 49.178042 49.744839 56.202575 50.105359 50.856851 418.81422   100
dylanbeaudette commented 4 years ago

@brownag what?

library(aqp)
data(sp5)
sp5$newhzid <- rev(1:nrow(sp5))
hzidname(sp5) <- "newhzid"
hzidname(unique(sp5, vars = c("soil","hzID")))

That cannot work because the current aqp::unique returns an integer index to unique profiles within a collection. The error is expected.

Let's split this thread into union / unique topics.

dylanbeaudette commented 4 years ago

We should also consider importing rbind.data.table(..., fill = TRUE) from data.table to replace plyr::rbind.fill in aqp::union

Good call. I'm 100% for replacement of plyr and reshape with data.table. Lets do this after the merge of aqpdf branch.

Related issues:

brownag commented 4 years ago

Deleted my previous comment. It is confusing and my code was confused. I was only trying to point out that changing what is perceived as typical behavior causes confusion -- and wound up being the guinea pig.

See PR for aqpdf to bring in the initial data.table support. we can add rbindlist etc. and tag the other issues. This issue is closed.