ncss-tech / aqp

Algorithms for Quantitative Pedology
http://ncss-tech.github.io/aqp/
54 stars 14 forks source link

think carefully before masking base functions #144

Closed dylanbeaudette closed 3 years ago

dylanbeaudette commented 4 years ago

I'm thinking that it may be unwise to cause the following:

The following objects are masked from ‘package:base’:

    split, union
brownag commented 4 years ago

I think it is fine because they are properly set up as generics for S4 classes.

dylanbeaudette commented 4 years ago

ok

brownag commented 4 years ago

While it is "fine" -- I do agree it is possibly unwise -- it is a little hairy given that we have defined additional arguments that are not SoilProfileCollections that slightly differ from the base function definitions... I did look into this somewhat when I had to sort out the S4 documentation for these things.

Both split and union have several possible conflicts in base/other packages -- I am not sure whether you were getting at "can we" or "should we" -- the latter is more philosophical and relates to the naming issue #140

dylanbeaudette commented 4 years ago

I'm more concerned with should.

brownag commented 4 years ago

... which should be discussed in the other issue probably. Can you elaborate? I don't see an inherent issue other than that it generates a message on load.

brownag commented 4 years ago

Possibly relevant: https://stackoverflow.com/questions/28321655/r-creating-a-generic-function-for-split-from-package-base-in-the-global-env

There are probably safer ways we can handle these masking -- and possibly ways to avoid the messages while retaining names -- if we are clever.

brownag commented 4 years ago

https://stackoverflow.com/questions/26963900/generalizing-three-dots-argument-dispatch-s4-methods-for-argument-set-i

brownag commented 4 years ago

Also, I was wrong to be so definitive in my first response. As you said, we should think carefully about this.

filter and split are "fine" in that they are unambiguous generics for SoilProfileCollections.

union is a problem because it takes a list as an argument. Not a SoilProfileCollection. So we are not dispatching it via S4. It has a very different method signature from the other union and it will mask it in an ambiguous way. So, the user would have to use base::union if aqp is loaded. This is not ideal at all.

r.bind.SoilProfileCollection kindof makes/made more sense (currently deprecated but functional). Usually we have to put things into a list anyway. I can see the value of having both abilities -- however when we cant be sure we are passing a SPC we probably shouldn't mask any names. The functionality of union (and split) is sufficiently different from both of the base methods. Indeed, aqp::filter is quite a bit different from stats::filter and is more like base::subset IMO. These are all good fodder for discussion of renaming to something that will not mask, or will have behavior similar to base but for SPCs.

brownag commented 4 years ago

See the changes I needed to make for documentation / dispatch of plot -- relevant to my above comments about changing the default arguments that are used in a generic. It is one thing to add arguments, but it is another to remove them. For plot, we need to define a y argument ... because that is what the generic contains.

We can put the documentation to the main Rd file where plotSPC is defined. Now ?plot shows the {aqp} method, if loaded.

plothelp

https://github.com/ncss-tech/aqp/commit/0b5ee26597df93d4bf0ba0313d8fe37c2b40764f

dylanbeaudette commented 4 years ago

Thanks for working on this. Chalk this up to another major set of decisions: generic (possible overloaded) function names vs. something like unionSPC. Conflicts with other packages aren't as much of an issue, but conflicts with base R seems like a recipe for confusion.

brownag commented 4 years ago

I recently added trunc,SoilProfileCollection-method as a wrapper method for glomApply(spc, function(p) return(c(constant_top, constant_bottom)), truncate=TRUE). It produces no messages about masking -- the generic is left unaltered.

Since the original base::trunc includes ... in the generic method signature, I can add whatever arguments I want to the S4 version of the method (for SPCs). The same cannot be said for our current usage of filter, union and split. I think w.e should consider changing these names entirely. I am inclined to fix filter since it has never been on CRAN. base::subset has the correct functional connotation and method signature (x, ...)

dylanbeaudette commented 4 years ago

Thanks for trunc, I can think of several use cases.

As for re-naming, I don't mind using the SPC suffix for most of these short function names: plotSPC, unionSPC, filterSPC, splitSPC. Extending subset is another option, your call as you wrote the back-end code. Once we decide, I'd like to deprecate subsetProfiles as there isn't really any reason to have that much overlapping functionality.

brownag commented 4 years ago

The base::filter is for linear filtering on time series. The reason to use filter in an aqp function name at all is b/c "a lot" of R users "think of" it as the verb that does subsetting using logical expressions. In base, the verb is subset. "Under the hood" filter and the other methods that use NSE use rlang .... but that is no reason to use dplyr verbs per se. We can have filterSPC as an alias, rather than adding filter for the first time on CRAN and deprecating it immediately.

Names in base like min max and unique are major conveniences for SPCs -- and they essentially do what their verb name says as best they can for SPCs. They can be implemented via S4 without changing or affecting default generic -- and don't issue messages on load -- is that the bar we set for whether "SPC" gets added to a function or not? Seems like a decent practical limitation to use to guide naming of the fundamental generics that work with the SPC class.

brownag commented 4 years ago

Side note: I wish aqp::unique was more easily usable with %>% -- if it returned an SPC object and not a numeric vector that would be possible. However, it returns a numeric index.

This is relevant to the discussion because we can add an argument. The base::unique generic includes ... ... however I think that the argument should be to use the current behavior and new default should be a SPC result. I failed to explain/convince you of this in #129. I am curious how much work it would be to fix that downstream (e.g. add an argument to return IDs, or refactor) the stuff for SoilWeb etc. where the MD5 hashes are used.

dylanbeaudette commented 4 years ago

The base::filter is for linear filtering on time series. The reason to use filter in an aqp function name at all is b/c "a lot" of R users "think of" it as the verb that does subsetting using logical expressions. In base, the verb is subset. "Under the hood" filter and the other methods that use NSE use rlang .... but that is no reason to use dplyr verbs per se. We can have filterSPC as an alias, rather than adding filter for the first time on CRAN and deprecating it immediately.

I'm fine either way, even though I periodically (ha! get it?) use base::filter. I've already started writing filter(x, ...) in my code, but not enough time to set precedent. You call on this one. If we keep filter, then adding filterSPC is a fine backup plan in case we need to revert. Good idea.

Names in base like min max and unique are major conveniences for SPCs -- and they essentially do what their verb name says as best they can for SPCs. They can be implemented via S4 without changing or affecting default generic -- and don't issue messages on load -- is that the bar we set for whether "SPC" gets added to a function or not? Seems like a decent practical limitation to use to guide naming of the fundamental generics that work with the SPC class.

Good idea. This means that plot,SoilProfileCollection will become plotSPC. Filter is kind of a special case, but in general I'd like to avoid masking base functions when we cannot match the original argument pattern of generic functions.

dylanbeaudette commented 4 years ago

Side note: I wish aqp::unique was more easily usable with %>% -- if it returned an SPC object and not a numeric vector that would be possible. However, it returns a numeric index.

True, but a lot has changed since the original motivation / implementation of aqp::unique.

This is relevant to the discussion because we can add an argument. The base::unique generic includes ... ... however I think that the argument should be to use the current behavior and new default should be a SPC result. I failed to explain/convince you of this in #129. I am curious how much work it would be to fix that downstream (e.g. add an argument to return IDs, or refactor) the stuff for SoilWeb etc. where the MD5 hashes are used.

There was no failure, rather the discussion on unique was put on the back-burner while we figured out aqp::split and aqp::union. At this point I am fine with breaking unique in favor of returning an SPC. I've got a reasonable list of places where I use this function and I think that I can safely work around the changes. Moving that discussion to #159.

brownag commented 4 years ago

Good idea. This means that plot,SoilProfileCollection will become plotSPC. Filter is kind of a special case, but in general I'd like to avoid masking base functions when we cannot match the original argument pattern of generic functions.

I think plot,SoilProfileCollection is fine actually since I added the y argument to our usage of it. It does nothing presently. This is how raster did it. And of course plotSPC already exists.

brownag commented 4 years ago

I'm fine either way, even though I periodically (ha! get it?) use base::filter. I've already started writing filter(x, ...) in my code, but not enough time to set precedent. You call on this one. If we keep filter, then adding filterSPC is a fine backup plan in case we need to revert. Good idea.

I am suggesting that I define subset,SoilProfileCollection-method and filterSPC. Whether we leave filter defined and deprecated for the first time we publish to CRAN is up to you -- but I was suggesting we not do that. Dump filter altogether is what I am leaning towards. Or I could do nothing and allow it to warn about the re-definition of the base R filter, as dplyr does every time it loads.

This issue may not have even started if it weren't for the fact that we get messages every time we load the aqp package on its own. If we adhere to the rule I stated above where we do not ever re-define the generics of base methods, we will only see those messages when we have non-base packages with same names (e.g. dplyr). I can look into what I can do to avoid even those messages -- as they may be very similar to the plot,SoilProfileCollection case above.

dylanbeaudette commented 4 years ago

Good points, and I agree with your suggestions. There will always be some noise at package load time, esp. with the popularity of incantations like library(tidyverse).

brownag commented 4 years ago

I have not forgotten about union. I think I have wavered somewhat in my resolve for tackling some of these... so rallying some last minute ideas.

The methods that can be dispatched via S4 for SoilProfileCollections are "safe" -- so split is OK -- but union still remains. Really, what we call aqp::union does not behave sufficiently like base::union to even call it something like unionSPC.

In https://github.com/ncss-tech/aqp/tree/union I took a crack at resolving this and propose a new syntax.

I propose we:

brownag commented 4 years ago

Tagging #168; moving some above commentary to draft PR

brownag commented 4 years ago

I am not sure how commonly folks run into masking issues with aqp, but was closing some browser tabs and thought I would add a link to this discussion.

Might be worth working up a brief aqp-specific tutorial describing ideas in this article as it pertains to common packages. In R 3.6.0+ we can use the conflicts.policy options to have more control over effects of library/require calls.

https://developer.r-project.org/Blog/public/2019/03/19/managing-search-path-conflicts/index.html

These are practical base R solutions to the "inevitability" of masking -- and highlight the difference between masking that is intentional (e.g. dplyr replaces union, setdiff etc with methods that work generically in place of base equivalents) versus unexpected (two packages that were not 'designed' to co-exist -- e.g. aqp and dplyr).

It also highlights that by default the generics that are handled by S4 are not considered conflicts unless in strict mode. While there are packages that manage conflicts, as author says this behavior should be lightweight and part of base R. It would be great if, as he suggests, these things could be defined somehow in the package DESCRIPTION.

dylanbeaudette commented 4 years ago

Thanks for offering ideas and several possible solutions.

  • remove all of the arguments to aqp::union other than the list. they are not needed. spatial result can revert to default for non-conformal case, and the na.rm argument is rarely used and simply triggers an error anyway. The method argument is interesting but not developed, and alluded to a variety of gaps in SoilProfileCollection methods. These could be addressed as punion pintersects psetdiff psetequal etc. analogous to the {data.table} set operations e.g. funion. This could warrant a heavier use of {digest} -> aqp::unique / lunique

I like the idea of removing all arguments to aqp::union until we figure out what to call the replacement. I'd like to talk briefly about the path we want to take forward: p-**** to build off related set operations (that is appealing) vs. more connotative and possible conflicting words like combine. I don't know which is more effective, but the set operation approach is growing on me.

  • deprecate aqp::union in next CRAN release, replacing it "officially" with pbindlist (or some to-be-determined name)

Yes please.

  • the name pbindlist draws from data.table::rbindlist the workhorse of the method -- we are "profile" binding rather than "row"

Yeah, I get that but the name isn't appealing to me. The logic is sound but I think that a more connotative name will help adoption.

  • remove or un-deprecate rbind.SoilProfileCollection, since there is no reason IMO to not support ... expansion aside from potential for misuse (do.call) but otherwise it has been deprecated since 2018 and it is time to go

I think that we should remove it.

  • until removal change the default arguments of aqp::union to match generic of base::union

Yes, thanks.

  • aqp::union still "works" except it now issues a deprecation message (and possible IDE warnings about missing y default argument due to borrowing the base R generic)

Good idea so as not to break things just yet.

brownag commented 3 years ago

I want to add to this that the choice to mask stats::filter has driven me nuts since I implemented it -- I don't like the noise on loading. And I also do not like that it is a kinda nuanced to use with {dplyr}. I would really like to encourage aqp::subset which is "safe" and an appropriate {base} verb. \

brownag commented 3 years ago

Proposed changes in #203--coupled with removal of some of our older previously deprecated functions cough aqp::union--should allow us to close this issue.