trellis-ldp / trellis

Trellis is a platform for building scalable Linked Data applications
https://www.trellisldp.org
Apache License 2.0
105 stars 21 forks source link

User Provided PROV Metadata #488

Closed lgleim closed 5 years ago

lgleim commented 5 years ago

I love that Trellis automatically collects provenance data w.r.t. the creation and modification history of its managed resources. However, I currently do not see a "correct" way to store user provided provenance data, such as additional information about the Used resources, which other Entitys any given entity might have been derived off, that it may be a new revision of another entity with previously different identifier etc. Since the code base already contains definitions for the entire PROV vocabulary, I was wondering whether you might be interested in accepting a pull request that enables user supplied PROV provenance information. I am also currently evaluating what the "correct" way to do this would be, since I am not aware of any standard w.r.t. providing provenance information to an LDP server or even any REST endpoint. As far as that goes, I am only aware of the PROV-AQ standard.

ajs6f commented 5 years ago

Hi, @lgleim!

I am not aware of any standard w.r.t. providing provenance information to an LDP server

You're not missing anything-- there is none, because none is needed. If you'd like to store user-provided provenance information in your LDP resource, just put those triples in with the rest of the RDF. Trellis offers the additional ability (not an LDP function) to allow you to customize the framework's generation of immutable (audit) triples, but that requires coding an AuditService implementation.

Or do you mean something else?

lgleim commented 5 years ago

If you'd like to store user-provided provenance information in your LDP resource, just put those triples in with the rest of the RDF.

Does this also work for Non-RDF Sources though?

Implementing an extended AuditService is exactly what I first had in mind, especially given that org.trellisldp.api.ResourceService and org.trellisldp.api.AuditService may be implemented independently of each other for the Persistence Layer and thus end up in different storage systems. EDIT: I think I misunderstood this one from the documentation. It now seems quite clear from the AuditService interface that it in fact does not connect to backend storage on its own.

Given the Memento/TimeGate implementation, my understanding is, that all records are effectively immutable anyway, correct?

acoburn commented 5 years ago

Does this also work for Non-RDF Sources though?

Yes, Non-RDF resources each have an RDF description (follow the rel="describedby" Link header), and you can place arbitrary RDF there.

Given the Memento/TimeGate implementation, my understanding is, that all records are effectively immutable anyway, correct?

In a sense, yes, Memento resources are immutable, but there are some distinctions that one can make between the immutability of Mementos and the immutability of audit records: first, mementos have a time resolution of 1 second (as per the specification) and audit records don't have that restriction. Second, a deployable trellis system may choose to enable/disable either or both of these services, which may affect which data is available.

The way I tend to think about the two systems is that the audit stream can tell you who modified a record at a particular moment in time while the memento system can give you snapshots of the resource at particular time intervals. Taking those two together, one could write code that would allow one to find which triples were added/modified/deleted by which user and at what time.

ajs6f commented 5 years ago

I would add that a Trellis instance can add arbitrary information (in RDF) to the audit/immutable data, including information that never appeared as part of the resource (or in the case of NonRDFs, its description). That's not at all true of Mementos-- they are supposed to be only what the RDF of that resource (or presumably, its description) was at that time.

ajs6f commented 5 years ago

It now seems quite clear from the AuditService interface that it in fact does not connect to backend storage on its own.

That's quite true, but it is also true that you can choose to implement different storage for mutable (user-provided) and immutable (Trellis-provided) RDF. trellis-cassandra does just that, with entirely different data layouts, and you could certainly use two persistence services separated in any way you want. The split comes in ResourceService, where some methods deal with immutable and others with mutable information. Hopefully, the details are clear from the Javadocs, and if not, holler!

lgleim commented 5 years ago

First of all thank you very much for the quick and detailed responses!

Yes, Non-RDF resources each have an RDF description (follow the rel="describedby" Link header), and you can place arbitrary RDF there.

Again, a few (too many) questions just to make sure I am not getting this wrong:

  1. While the LDP Primer provides such an (non-normative) example, this is not actually standardized behavior, correct?
  2. Supplying this metadata requires at least one additional round trip (create Non-Rdf, get Response with rel="describedby" Link header, post metadata to that URI, wait for response)?
  3. There will still be a distict immutable audit record (if enabled), independent of the description?
  4. The description is by no way enforced to be an RDF Resource (c.f. https://www.w3.org/TR/ldp/#link-relation-describedby)?

In a sense, yes, Memento resources are immutable, but there are some distinctions that one can make between the immutability of Mementos and the immutability of audit records: first, mementos have a time resolution of 1 second (as per the specification) and audit records don't have that restriction.

  1. If the update rate is higher than 1 record per second (e.g. 1 record/ms), which record would be retrieved by the memento identifier? (probably the one closest to the provided timestamp as required by the standard?)
  2. Does the trellis API provide a way to retrieve Mementos at a granularity finer that what the Accept-Datetime request header allows (e.g. Thu, 31 May 2007 20:35:00 GMT)? Maybe something like https://example.org/repository/resource?version=1508889600734.012345? Or are they not actually stored in this case?

I would add that a Trellis instance can add arbitrary information (in RDF) to the audit/immutable data, including information that never appeared as part of the resource (or in the case of NonRDFs, its description).

So this should conceptually always happen in the AuditService, correct? Or is there also justification to do this in the ResourceService?

That's not at all true of Mementos-- they are supposed to be only what the RDF of that resource (or presumably, its description) was at that time.

This completely makes sense to me. As such it also makes sense to keep Provenance information separate. However:

  1. The Prefer: return=representation; include="http://www.trellisldp.org/ns/trellis#PreferAudit" Headers are completely custom though and the retrieval process not standardized anywhere, correct?
  2. How do you envision rel="describedby"-type metadata and Trellis Audit Records to play together in the best case and does this differ for RDF / Non-RDF Resources?

The split comes in ResourceService, where some methods deal with immutable and others with mutable information. Hopefully, the details are clear from the Javadocs, and if not, holler!

Thanks so much for the pointer! :)

acoburn commented 5 years ago

While the LDP Primer provides such an (non-normative) example, this is not actually standardized behavior, correct?

This is part of the LDP specification

Supplying this metadata requires at least one additional round trip (create Non-Rdf, get Response with rel="describedby" Link header, post metadata to that URI, wait for response)?

Yes. LDP can be a bit "chatty", which is where HTTP/2 comes in handy.

There will still be a distict immutable audit record (if enabled), independent of the description?

Yes.

The description is by no way enforced to be an RDF Resource (c.f. https://www.w3.org/TR/ldp/#link-relation-describedby)?

The LDP specification does not require this, but that is how it is implemented in Trellis (i.e. that any description for a NonRDFSource is an RDFSource)

If the update rate is higher than 1 record per second (e.g. 1 record/ms), which record would be retrieved by the memento identifier? (probably the one closest to the provided timestamp as required by the standard?)

This is, again, an implementation decision. The existing implementations store mementos at a 1 second resolution because the sub-second rounding issues turn out to be really confusing to HTTP clients. In this sense, if two mementos are written in the same second, the implementation would decide which one is kept.

Does the trellis API provide a way to retrieve Mementos at a granularity finer that what the Accept-Datetime request header allows (e.g. Thu, 31 May 2007 20:35:00 GMT)? Maybe something like https://example.org/repository/resource?version=1508889600734.012345? Or are they not actually stored in this case?

It used to be that the version parameter supported microsecond precision, but the interaction ended up being more confusing to HTTP clients, but that could certainly be revisited. It is largely a question of this: given a java.time.Instant, what URL is generated to correspond to that value. At present, the Instant is truncated to second precision and the URL is generated from that, but this could definitely be revisited.

So this should conceptually always happen in the AuditService, correct? Or is there also justification to do this in the ResourceService?

One could do this in the ResourceService, but I would argue that, if a client sends a particular RDF graph to the server, the resource server should accept that graph as the resource without adding additional information. I do make an exception for LDP types in that (via configuration) one can have a resource service add in the particular LDP type of the resource as a triple on GET requests, but that sort of thing is purely an implementation decision.

The Prefer: return=representation; include="http://www.trellisldp.org/ns/trellis#PreferAudit" Headers are completely custom though and the retrieval process not standardized anywhere, correct?

Yes, that is pure invention, though it's also entirely allowable under the definition of the various specifications. Effectively, it makes use of existing extension mechanisms to define these sorts of extensions.

How do you envision rel="describedby"-type metadata and Trellis Audit Records to play together in the best case and does this differ for RDF / Non-RDF Resources?

For example: given a series of Mementos (M1, M2 and M3) and a series of corresponding audit records (A1, A2, A3), if there were three triples added between M1 and M2, one should be able to infer that the agent listed in A2 was the one who made those changes. For NonRDF resources, a client can examine the two resources by downloading them and comparing the bits. If they have changed, between, say M2 and M3, one can infer that the agent described in A3 was the one who made that change.

ajs6f commented 5 years ago

Just a few follow-ups:

The existing implementations store mementos at a 1 second resolution because the sub-second rounding issues turn out to be really confusing to HTTP clients.

This is not true of trellis-cassandra. It stores all Mementos and if asked for a Memento when more than one would satisfy, normally chooses the most recent. That's a bit arbitrary, but I wanted to at least record all the possibly-interesting info.

So this should conceptually always happen in the AuditService, correct? Or is there also justification to do this in the ResourceService?

@acoburn was relatively mild in his response: I would go further. I think you should always add immutable tuples in the AuditService. I think adding them in the ResourceService could be very, very confusing over the long haul.