ropensci / datapack

An R package to handle data packages
https://docs.ropensci.org/datapack
44 stars 9 forks source link

Is a method to determine sysmeta formatId, mediaType needed? #77

Open gothub opened 7 years ago

gothub commented 7 years ago

When a script creates a SystemMetadata object (i.e. when a DataObject is created), the sysmeta formatId must be specified.

Is it advisable to have a method that automatically determines the formatId by file extension or file contents? This is an old problem, with know issues such as reliability.

Is it advisable to have an automated way to determine formatId, or to rely on the user determining and specifying this?

mbjones commented 7 years ago

@amoeba has a function that tries to guess the formatId. Its nice when it works. It doesn't always work, so we've been discussing whether no default is better than an incorrect guess. Let's discuss further. Maybe Jeanette and Jesse have thoughts on this too.

amoeba commented 7 years ago

Yeah, guess_format_id uses a hard-coded map between D1 format IDs and file extensions: https://github.com/NCEAS/arcticdatautils/blob/master/R/util.R#L79. I threw in a custom routine for NetCDF files that uses the metadata to guess the specific NetCDF version but otherwise things are based on file extension alone.

There are limitations and even major issues:

From a user perspective, I have been told the guessing is nice but I don't personally feel like it's really necessary. If the format ID isn't guessed, I think giving users a useful mechanism in R to find the available values would be needed. e.g.,

> magicUploadFunction(my_path)
Error: You must specify the format_id argument when using magicUploadedFunction. Run `formatsList()` to see a list of possible values.
> formatsList()
format_idid                             Name        Type
eml://ecoinformatics.org/eml-2.0.0      EML 2.0.0   METADATA  
eml://ecoinformatics.org/eml-2.1.0      EML 2.1.0   METADATA  
eml://ecoinformatics.org/eml-2.1.1      EML 2.1.1   METADATA  
text/csv                                CSV         DATA