ropensci / EML

Ecological Metadata Language interface for R: synthesis and integration of heterogenous data
https://docs.ropensci.org/EML
Other
98 stars 33 forks source link

Consider adding a warning to `write_eml` generates random `packageId` and `system` values #293

Open amoeba opened 4 years ago

amoeba commented 4 years ago

Stemming from #292,

Right now, when an eml object is written to disk with write_eml without a packageId, write_eml fills in packageId with a UUID and sets system to uuid:

https://github.com/ropensci/EML/blob/6c2911c4001a60e9a838059ec3c0c8fd7018f6a2/R/write_eml.R#L27-L31

In #292, this created some understandable confusion for @scelmendorf because it meant that eml_validate's behavior was inconsistent depending on whether the object was written to disk first or not.

In #292, @cboettig wrote:

I think the best fix would be that write_eml should throw a warning if no packageId is found, and explain that it is adding an ID automatically. I'd love a PR for that if we have consensus on that.

Also in #292, @mobb wrote:

The temporary packageId and system are fine (as inserted by write_eml). However, users should be aware that this is happening, so I recommend a message to the screen if write_eml inserts these. For those who control their packageIds, the msg will help remind them to assign it.

I agree with the above and it seems we have a consensus but I wanted to file a standalone issue for discussion in case it was needed. Open to comments or suggested wording of the warnings/messages.

I propose:

  1. If packageId is not set when write_eml is called, issue a warning.
  2. The warning will have the text: "packageId generated automatically because it was not already set. See ?packageId for more information."
  3. Create a man page at ?packageId with more information. TBD. It'd explain how to set packageId and system and how to come up with sensible values.
mbjones commented 4 years ago

While I fully appreciate the motivation for providing a packageId automatically, I am still uncomfortable with doing so in write_eml, which is meant to simply be a serialization of an in-memory object. Side-effects that aren't in user control often cause problems, and I think what is written to disk should be round-trippable with the in-memory object.

Instead of changing content, couldn't we simply provide warnings for any validation errors (including a missing packageId) so the user will know that their document doesn't validate when it is written to disk? This would allow folks that want to export a partial document for further downstream processing to do so, while still helping to prompt when required fields are missing.

cboettig commented 4 years ago

@mbjones Yeah, I'm with you :100: on that.

I think we should also add a helper routine to add_packageId() that could default to the current UUID method to avoid the possible complexity associated with this task. That would be far more transparent than doing it on write.

Perhaps we can also add some additional logic into the validator which would report that the EML was valid except for a missing packageId (possibly with a suggestion on how to add one?, e.g.)

EML file is valid except for a missing packageId. You can create one now with `add_packageId()`

I've always been a bit annoyed at the standard XML validation errors not being as user-friendly or comprehensive as they should be.

Thoughts?

mbjones commented 4 years ago

Sounds great!