w3c / tdm-reservation-protocol

Repository of the Text and Data Mining Reservation Protocol Community Group
https://www.w3.org/community/tdmrep/
Other
10 stars 8 forks source link

Comment on Proposal based on http headers' #10

Closed claudiotubertini closed 3 years ago

claudiotubertini commented 3 years ago

The proposal is a good starting point but I'd like to add a few lines on what in that document is called 'license'. To set the context I refer to the use cases in https://w3c.github.io/poe/ucr/ and to the ODRL modeling in https://w3c.github.io/odrl/bp/.

TDM-a tell us what the publisher want to do, while TDM-b gives us the url where we hope to find a few more specifications. The information resource (web page) that we have just downloaded with that response headers (TDM-a and TDM-b), may contains many licensed objects, images, texts, videos, etc. The page with the license will details all aspects using ODRL modeling, refering to every assets with their URI. We can follow this route but a response header is collected only after we have found a promising page, very often we have to answer question such as "give me photos of Rome that I can publish in commercial applications" (see https://w3c.github.io/odrl/bp/). In this case it seems that we have to search all license pages and find the correct one. The burden will be on the application and it should not bother us, but we must be prepared that the license information comes before the response header not the other way round. The response header approach may be of some use but I doubt will be so central to our problem. It's more apt to be useful in a world of static text document not in meshing contents applications.

giuliamarangoni commented 3 years ago

Dear Claudio, I'm not sure to fully understand your point but your message made me think about a couple of issues. You said:

TDM-a tell us what the publisher wants to do, while TDM-b gives us the url where we hope to find a few more specifications. The information resource (web page) that we have just downloaded with that response headers (TDM-a and TDM-b), may contains many licensed objects, images, texts, videos, etc. The page with the license will details all aspects using ODRL modeling, refering to every assets with their URI.

To my understanding, the proposal based on HTTP headers would allow associating a TDM declaration (and a license, if available) to any web resource, thus also at a granular level to individual objects in the HTML page. So, in principle, you may have, for any downloaded object in the HTML page, an HTTP response with TDM-and TDM-b. In that case, the license would come after, and not before the HTTP response header.

I'm not sure if the current HTTP-based solution would support your example. Indeed, the TDM-a values say if TDM rights are reserved or not for the resource to which the HTTP response refers to. In your example, the TDM declaration (TDM-a) referring to one web resource (the HTML page) is associated (TDM-b) to a TDM license for multiple resources. I think that this case should be probably better managed by applying TDM declaration at the level of individual HTTP response for individual objects nested in the HTML page.

Otherwise, if both the approaches would be supported, then there would be the risk of ambiguity in the semantics of TDM-a value. When values are 0 or 1, how the TDM agent could know if the TDM declaration returned in the HTTP response of a web page applies only to the HTML content (i.e. to one resource) or also to the individual objects contained in it (i.e. to multiple resources)?

Anyway, I hope that we can talk about it at next meeting. Giulia

claudiotubertini commented 3 years ago

You are mostly right. But you are right even in the last lines where you are afraid of finding a risk of ambiguity. Let have an example. You can open a web page, actually any you like, for example https://www.nltk.org/book/ch01.html using a dev tools widget. I enclose a screenshot. Screenshot from 2021-04-01 20-08-56 You can see that you download a bunch of files, everyone with their http header, but we are interested in what we found inside the page, may be where we find a long citation of Moby Dick. The author is the owner of the main text but there may be excerpts owned by others. And we have to look to the details of the content in the page. I want to add that I find very good, in term of progressiveness, the proposal based on http headers, but I believe it cannot be exaustive for the many cases we can find out there. P.S. There is a good chance that I do not understand the legal context framed by the EU regulation. In that case try to be as forgiving as you can.

llemeurfr commented 3 years ago

Dear @claudiotubertini, your last comment is about the legal context of the EU regulation. I can only say that the scope is wide: "lawfully accessible works and other subject matter" can be virtually anything. In the context of the Web, we scoped it as "Web resources", and we defined what we mean by this in our vocabulary.

This means that we did not envisage to work on Sections/Parts/Fragments of Web resources; the Web resource (the content you can fetch with an http GET) is the unit of work. Trying to break the Web resource into smaller pieces would complexify a lot the solution we are looking for, in my opinion, both for rightsholders and for TDM Agents.

claudiotubertini commented 3 years ago

Thank you for your clear answer, I must admit I didn't notice that "web resource", with slash URI as opposed to hash URI, should be the atomic element with which we have to work.