ropensci / EML

Ecological Metadata Language interface for R: synthesis and integration of heterogenous data
https://docs.ropensci.org/EML
Other
98 stars 33 forks source link

Feedback / Improving `set_attributes()` function #151

Closed cboettig closed 5 years ago

cboettig commented 8 years ago

The creating EML vignette shows the set_attributes() function in action, and I think it's generally the right idea but could use some feedback.

While I like that it operates on a data.frame as an argument (and, though not shown in the vignette, this is the same data.frames returned by parsing an EML document and using get_attributes(eml, join=TRUE), which has a nice symmetry). However, it does feel a bit weird that we expect the data.frame to have a very specific structure, e.g. columns must be named attributeName, attributeDefinition, formatString etc). I do wonder if it would make more sense to have each of these columns as an explicit argument to set_attributes()? Having them in a data.frame does ensure they are all the same length, etc, but listing them separately would be easier to document.

In either case, it would still be difficult to know what fields are required, particularly as this depends on the kind of attribute (e.g. formatString column is only needed if documenting dateTime attributes). Any advice on improving this interface?

cgries commented 8 years ago

I really like the data frame for documenting attributes I would even include the column classes in that attribute description data.frame?

I can see shipping the package with a formatted csv file for this data frame containing all the column headings needed to properly document attributes so that people can fill it out in Excel. Or even better have a function that creates it pre-populated with the column names and maybe the R version of the column classes if the data are loaded as data frame as well. I realize the column classes may have to be changed manually later. Such csv fiel would give a great structure for what to put in and especially how many attributes to document, which is very hard when you have more than ~10 columns in a table.

cboettig commented 8 years ago

Great idea! Yes, I think an Excel template is probably the perfect way to indicate what those columns should be.

cgries commented 8 years ago

In the final EML the attributes are being sorted alphabetically by attributeName. While in theory that shouldn't matter, I think it would be better if they stayed in the sequence in which they appear in the data table and in the data frame.

cgries commented 8 years ago

Sorry, I should have been a little more clear here. When I get the attributeList back from set_attributes, the attributes are sorted alphabetically by attributeName.

mbjones commented 8 years ago

Uh, sorting is probably a bug. EML requires attribute definitions be in the same order as the columns in the file.

cboettig commented 8 years ago

Good catch, get_attributes will now return the a data.frame where the rows are always in the order of the attributes appear in attributeList. I believe set_attributes has always written the attributes in the order given.

cboettig commented 8 years ago

Just added a template. Still a user would probably require documentation to know what fields apply to what measurementScales. Not sure how to address that, but maybe by including an example? Do we need a template for the factors definitions table as well?

One thing about creating an Excel template for the set_attributes table: because most columns only apply to one data type (numeric data, factors, dates, etc) this table can have lots of missing values. It might be easier to define each type in a separate table, but that leaves open the issue of attribute order that @mbjones points out.

Originally eml_get had the option to return separate tables by unit type, which resulted in tables that were both less sparse and smaller (both in width and length) to fit on a screen. While it's obvious what column each refers to based on the attributeName key, this doesn't preserve a sense of order of the attributes so I've dropped that.

maelle commented 8 years ago

FWIW, I am using such a template for creating my first eml. I have a relational database, so I added a column "table" at the beginning, and when I read the file in R I filter relevant information for each table. It looks this way

attributes_ebam <- filter(attributes, table == "ebam")
attributes_ebam <- set_attributes(attributes_ebam)

datatable_ebam <- new("dataTable",
                 entityName = "ebam",
                 entityDescription = "eBAM measurements",
                 physical = set_physical("ebam"),
                 attributeList = attributes_ebam)

But now that I think of it, I could loop over all different tables. Maybe next time, this time I wanted to take my time. :smile:

I don't mind having many empty cells. I found it much easier to have all attributes of all tables in the same file. I have another file with custom units. Having them in a second sheet of the same Excel file sounds smarter!

cboettig commented 8 years ago

@masalmon Thanks. By the way, the physical element is designed to indicate how the data is actually stored. By default I have it create documentation for a 'standard' csv file, & ?set_physical shows some of the options for non-standard csv/tsv/text formats. However, relational databases have their own corner of physical, which you can create manually with a lot of new commands but which doesn't have a helper function yet (#103). See https://knb.ecoinformatics.org/#external//emlparser/docs/eml-2.1.1/./eml-physical.html

@mbjones Could you point us to some examples of EML documenting some SQL database or such?

mbjones commented 8 years ago

@cboettig Relational databases are a bit odd in that most RDBMS do not have an inherent ordering of attributes -- their physical format is generally both proprietary and opaque (and often internally represents the same data in multiple redundant ways, such as in both data structures and indices). I have not seen EML records describing native relational databases in their binary physical format, with the exception of MS Access database files. These too are opaque, and so the ordering of columns is somewhat arbitrary. In these cases, I would recommend using externallyDefinedFormat/formatName to provide the well-known name of the relational format, and then use the normal dataTable and attributeList to describe the logical structure. EML does not provide a mechanism to map between the logical model of a relational table and the proprietary physical format -- this would really be out of scope. Attribute ordering only really applies to files with textFormat descriptions such as CSV files that can be openly read.

In most cases, particularly for archiving, it would be better to export a proprietary relational file to open CSV or another open format, and describe that before storing the package in a repository.

maelle commented 8 years ago

@mbjones so the physical information is repeated for each table?

cboettig commented 8 years ago

Yes, I believe that's correct -- if each table is in a different CSV file, say, then we need a physical node giving the name / location of each file.

maelle commented 8 years ago

@cboettig ok, thanks, then I'll add information about the database itself elsewhere (& anyway I build it using csv files)

cgries commented 8 years ago

@cboettig the attributes in my EML file are still sorted by alphabet. I wonder if I am getting the latest updates. The readme file seems to have two different places to install from: install.packages("EML", repos = c("http://packages.ropensci.org", "https://cran.rstudio.com")) or devtools::install_github("cboettig/EML")

The latter doesn't work for me, but it works if I use devtools::install_github("ropensci/EML")

but is that the latest version? Sorry for still being somewhat github illiterate.

cboettig commented 8 years ago

@cgries Argh, that's my mistake! yes, your last line is correct,

devtools::install_github("ropensci/EML")

will always install the latest version. The previous option,

install.packages("EML", repos = c("http://packages.ropensci.org", "https://cran.rstudio.com"))

is just there because it doesn't require devtools. It is almost as recent, but is based on nightly builds, so if I've just pushed a bug-fix earlier that day, than it won't do any good.

Sorry about that sorting behavior. Definitely try with the latest version and if it still happens, can you paste some code I can run locally to reproduce what you are seeing and I'll try and track down the bug.

cgries commented 8 years ago

@cboettig here is the code:

attributes <- read.csv("sparkling2014Metadatawtemp.csv", header = TRUE, sep = ",", quote = "\"", as.is = TRUE)

# get the column classes into a vector as required by the set_attribute function col_classes <- attributes[,"columnClasses"]

#take that column out of the meta data frame again attributes$columnClasses <- NULL

#add code definitions for the flag column - format for readability flag_wtemp <- c(D = "Sensor malfunction produced bad values set to missing", H = "Data suspect values outside of expected range", L = "Non standard routine followed")

#turn them into a data frame with three columns - attributeName, code, definition factors <- rbind( data.frame( attributeName = "flag_wtemp", code = names(flag_wtemp), definition = unname(flag_wtemp) ) )

attributeList <- set_attributes(attributes, factors, col_classes)

and then the attributes come out sorted.

I am attaching the meta data file, just change the .txt back to .csv, it wouldn't let me upload a .csv file. sparkling2014Metadatawtemp.txt

cboettig commented 8 years ago

@cgries Thanks! Should be fixed now.

cgries commented 8 years ago

@cboettig yes!! thank you so much, it's working great!

cboettig commented 5 years ago

looks like this was resolved.