ropensci / EML

Ecological Metadata Language interface for R: synthesis and integration of heterogenous data
https://docs.ropensci.org/EML
Other
97 stars 33 forks source link

Allow all of the entities in the metadata to be ordered #316

Open laijasmine opened 3 years ago

laijasmine commented 3 years ago

When we use read_eml it groups the entity types into lists. Currently, it is not possible for users trying to edit metadata using this package to specify the order of all the entities (spatialVector,, dataTable, otherEntity etc.) .

We can modify the order within a group (for example, all of the dataTables):

datatable 1 datatable 2 otherEntity 1 otherEntity 2

However, if you want to have it display with the different entity types inter-dispersed it is not possible:

datatable 1 otherEntity 1 datatable 2 otherEntity 2

mbjones commented 3 years ago

Thanks, @laijasmine And just to clarify, the issue here is that the schema model in the EML XSD for dataset is:

dataset = ...a bunch of fields... (dataTable | spatialRaster | spatialVector | storedProcedure | view | otherEntity)* ... bunch more fields...

So, the entities can come in any order and is under author control. The R EML package should not be reordering these elements, as the order may have meaning or utility to the author. Other XML tools in other languages routinely preserve this order, and this seems to be an implementation issue in likely the emld JSON handling. So I am flagging this as a bug.

amoeba commented 3 years ago

Looks like something we can fix. emld uses lists which do allow duplicate names in the same list. For example, I can mix and match entity types within my document and see R preserves things:

> names(my_eml$dataset)
[1] "title"       "creator"     "contact"     "dataTable"   "otherEntity" "dataTable"  
[7] "otherEntity"

but when serialized, two things happen:

1 The entity types are grouped together, breaking user-specified order

  1. Element names like <otherEntity.1> are used for the extra entity types.

The routine we use in emld appears to perform an schema-aware sort before serializing which is probably to give users the power to specify elements in an order other than schema-order but still get valid EML. So I think a change here has a cost and we should weigh each one. @cboettig do you have any preference here?

cboettig commented 3 years ago

Yup, I'm not sure there's a straight-forward solution here. emld is using a JSON-LD model, which gives explicitly semantic interpretation to EML elements. Like any semantic data, all meaning must be explicit and not implied by ordering -- JSON-LD operations (expansion, compaction) being used do not guarantee order, only semantics. emld enforces the ordering constraints imposed by the schema, so the order it produces will always be technically valid, but it does not allow users to encode information in ordering.

This semantic assertion being made by emld is of course technically incorrect, and creates several edge cases that fail (docbook-based markup being another, and some special edge cases in XML namespaces are probably yet another) without special handling. Arguably the package shouldn't be mapping into JSON-LD at all. But I think as a mental model, the user gains quite a bit of simplicity.

Like @amoeba , this has the nice side-effect that a user doesn't need to remember which is the right ordering; though of course a smart enough tool should be able to preserve that feature while still preserving the user's ordering. The question in my mind is then how does one go about capturing the semantically meaningful ordering explicitly in the JSON representation, so that it can be preserved?

Not to dismiss the issue, but I'm not quite sure I understand the use case(s) for controlling the order. I don't tend to think of the raw XML as being a 'display' format, and it seems like it should be possible to control the display (in say, an HTML rendering) independent of the order. In my experience, it is usually better to make, say, the relationship between datatable1 and otherEntity 1 be expressed more explicitly, and can be fragile to merely have that be implied by the ordering, even when working purely in XML-based tools.

amoeba commented 3 years ago

Thanks @cboettig.

Not to dismiss the issue, but I'm not quite sure I understand the use case(s) for controlling the order...

The purpose is here, as I understand it, is to control the display/rendering of the data in the XML within a web context (on our data catalog). Is that about right @laijasmine? If you have an example of a dataset where you'd like to see the entity types interspersed amongst each other, it might help here too.

So I think you're right in pointing out that there are more appropriate ways than tweaking the serialization order to get the desired effect. Unfortunately, we don't have something that seems like the right fit to do this at the moment.

mbjones commented 3 years ago

Thanks @cboettig My take on this is that, by using JSON-LD as our internal representation, we have a lossy intermediate format that drops some of the explicit semantics that is present in XML. In XML parsers, repeating elements are always preserved in document order (although attributes are explicitly not). This is the basis on which XPath predicates work, as they have predicate selectors (e.g., the second creator child //dataset/creator[2] or the last creator child //dataset/creator[last()] or even the next to last child of dataset //dataset/*[last()-1] ) that are by definition dependent on ordering of repeating elements. If repeating elements could be reordered arbitrarily by a processor, then those XPath processors would return arbitrary results, and would be useless. So, by going through JSON-LD, we are losing data, and I think the intermediate JSON-LD representation needs to be supplemented to preserve document order just as a compliant XML parser would.

cboettig commented 3 years ago

@mbjones yes, that's precisely what I was trying to say, too. I think JSON-Ld format is explicitly lossy, and though you could probably still hack around it, the JSON-LD representation would really need to supplanted to be a compliant XML parser.

I guess I was arguably trying to suggest that this bug is a feature, as it doesn't allow you to encode information in ordering even though XML allows that (e.g. for comparison NeXML explicitly cites being order-independent to be consistent with semantic principles, even though it's also based on XML -- just because order is inherit to XML does not mean it must be respected by a of an XML-based standard). Of course EML doesn't make such assertions, but nor does it define an explicit meaning that is supposed to be implied by the order. In general, I think there's value in making the implicit relationship suggested above more explicit in metadata, and I think there is value in separating issues of presentation/appearance from issues of meaning. All of which is to say that emld was a particularly opinionated take on working with EML that is breaking away from some XML-only aspects very intentionally, and maybe wasn't the best choice for what is intended as a more general tool in the EML package. (though there already is xml2 and XML for working directly with XML following XML rules). (I also do think it would be convenient if there was an official JSON serialization of EML, just as there now is for NeXML...

I actually don't think it would be that hard to do a list-based EML without the emld dependency, and might be much cleaner and lighter overall as well.