ropensci / EML

Ecological Metadata Language interface for R: synthesis and integration of heterogenous data
https://docs.ropensci.org/EML
Other
98 stars 33 forks source link

Create a set_packageId() function #294

Open cboettig opened 4 years ago

cboettig commented 4 years ago

Following up on #293, proposal to resolve results in two separate tasks: removing the creation of setting id on write_eml(), and adding a separate function to set_packageId(), and separately, possibly to modify validation error message specifically for the packageId() case to provide a nice clear error message on this (perhaps not needed but could ease the transition since this is a breaking change?)

I'm a bit unsure on the preferred syntax for set_packageId(). In the set_packageId branch I've added a mockup with the syntax like this:

my_eml <- set_packageId(my_eml, packageId = "abc123", system = "internal")

All fields have defaults, so you can create a new eml object with package id using:

my_eml <- set_packageId()`

But I'm surprised just how many interface questions this raises. Most set_* methods take only the fields to be added and construct a list fragment, they don't take an argument for the full existing eml list and return the full eml object, but maybe that's okay here since these are top-level attributes?

Or it might be better to call this set_eml(), creating a top-level constructor and not take an existing eml list object as an argument, e.g. something that looks like:

my_eml <- set_eml(title = ...., creator = ...., packageId = ..., system = ....)

The function would probably need an assertion to make sure that the system is set manually if the packageId is set manually too.

Thoughts @amoeba @mbjones others?

amoeba commented 4 years ago

Thanks for raising this. I'm really not sure and wanna chew on it.

Re. set_: When I first saw you use set_* years? ago, I thought it would take my EML and return modified EML. I think that's semantically nearest to set. This changes the behavior of other set_ functions though. It allows code like eml %>% set_title("Foo") %>% set_abstract("Bar") -> eml. Just my 2c there.

Also: Another way to make set_packageId() work would be my_eml$packageId <- set_packageId(), which is a bit odd but matches the current implicit interface on the set_* functions.

I think @jeanetteclark and @dmullen17 could offer some good insight here too.

jeanetteclark commented 4 years ago

I agree that having set_packageId return a full eml object is a little awkward. However, since the very beginning of my use of the EML package I have wanted a top level helper to create a minimally valid EML document without fussing around with list (this was an even worse problem in the S4 version of EML!). Because of that, I like the set_eml option outlined above the best.

cboettig commented 4 years ago

Thanks @amoeba ! I believe the other set_* methods are identical to the methods in the original S4 version of package (pre emld), where they were essentially object constructors for every complexType object (since we had class definitions for each of those). Since that meant things like attributeList were valid objects with there own methods, so I think in S4 methods world that made some sense (though strictly following convention the constructor should have been just the object name, attributeList(), etc, and obviously we never wrote constructors for all the 100+ complexTypes). In swapping in emld in the backend we tried to change as little as possible int the original package API (which was probably a bit silly since almost any real-world use code needed to use some S4 accessors and thus broke anyway, as you all know all too well!) anyway, old history.

I agree with you :100: though that the pipe structure you outline seems much more natural (and echoes a more functional programming style). I am very tempted to switch all the other methods to that. Downsides are that this would obviously be a breaking change, so I'm not sure it's worth it.

Also, most complexTypes don't have a set_ method, and are intended to be just built as lists, so it might confuse users to realize that a lot of practical EML couldn't be constructed entirely in a pipe chain at this time.

So I guess I'm still leaning to set_eml() proposal for now, but still undecided and would love more input.

amoeba commented 4 years ago

The backwards compatibility point is a big one and I think its worth avoiding such a breakage change as I describe. I think set_eml() is a fair compromise at this point.

cboettig commented 4 years ago

@amoeba @jeanetteclark Would you be interested in a quick PR defining a set_eml() method along these lines? I think it should also be added to the example in README, and then write_eml() should drop the behavior of automatically adding a packageId (but possibly gain a warning about creating one with set_eml() if no packageId exists).

Thanks!

jeanetteclark commented 4 years ago

Hey @cboettig I can try to give this a shot next week