solid / specification

Solid Technical Reports
https://solidproject.org/TR/
MIT License
486 stars 46 forks source link

Content of Turtle and RDFa documents should be wholly and entirely preserved #342

Open TallTed opened 2 years ago

TallTed commented 2 years ago

Originally discussed by @RubenVerborgh, @kjetilk, and @TallTed in https://github.com/solid/specification/issues/301#issuecomment-955206910 and preceding

[@TallTed] I am rather weary of explaining this over and over, and hearing back, "we don't care about your idea of what data is."

[@RubenVerborgh] I do care about your idea; but we need some pointers to back that up. And insights into the consequences of adopting that viewpoint versus another; I see many thing we'd not be able to do, not really gains. I don't know any sources that consider RDF whitespace and comments to be data.

If they need to be data, then the MIME type should not be RDF, but rather text/plain or similar (or maybe even better application/octet-stream to avoid encoding issues), and then data will be preserved.

[@TallTed] then non-RDF contents of Turtle and RDFa files MUST be preserved and these files MUST be treated as ldp:NonRDFSource.

[@RubenVerborgh] A citation would be helpful.

My understanding of the matter is that a content type determines what transformations are allowed with preservation of semantics. This is not unique to RDF at all. Minified JSON carries the same meaning as the unminified version.

(@TallTed Should you want to continue the discussion, could you please open another issue as per @kjetilk's suggestion?)

TallTed commented 2 years ago

There is no citation for you, @RubenVerborgh, because those MUSTs are not taken from RFC, REC, or the like. They do, however, flow from the understanding that I believe was commonly shared among the LDP WG when writing the LDP REC, upon which Solid at least claimed to be based at one time, even if Solid is now dis-claiming that basis.

One of Solid's early claims, if not promises, was to be a filesystem-ish datastore, which could be backed by an RDF store and could support SPARQL over the stored data, but both SPARQL and RDF features were considered a heavy-lift, so these were not promised.

As an author of RDFa and Turtle documents, I consider the HTML content in the former and the whitespace and comments in the latter to be important -- else I would not have created RDFa at all, nor added whitespace and comments to the latter to make human processing of these files easier.

In other words, if I did not care about the non-RDF content of these documents, I would have chosen a more "pure" RDF media type that did not support comments and/or author-applied whitespace -- which might include JSON-LD, which certainly does not support comments, and does not promise to maintain whitespace.

Still, even with more "pure" RDF documents, I would not expect them to be broken down into their component RDF and only that saved in an RDF store without clear warning and user choice. I would generally expect the documents to be preserved as documents, complete with all their metadata (creation date in particular, but also modification date, and any other metadata supported by and transported from their origin to the Solid store).

To some of your specific comments...

I don't know any sources that consider RDF whitespace and comments to be data.

There are no "RDF whitespace and comments". There are syntactically valid, and presentationally and contextually valuable, whitespace and comments in Turtle documents. There is syntactically valid, presentationally and contextually valuable, content (a/k/a data) in RDFa documents.

My understanding of the matter is that a content type determines what transformations are allowed with preservation of semantics. This is not unique to RDF at all. Minified JSON carries the same meaning as the unminified version.

I, and I am quite sure others, would not be happy to have some of their data discarded without warning because you, and possibly others specifying and coding Solid and related tools, decided that their Turtle- or RDFa-contained data was not really data.

To my knowledge, there is nothing about the IANA-maintained media type lists which says "this part of this media type's content is real data and SHOULD/MUST be retained, but that part of this media type's content is not 'real' data and MAY be discarded."

Minified and unminified JSON (and JSON-LD) are trivially and losslessly transformable into each other. There are no comments to be retained or lost, because JSON doesn't support comments (as of today, that is; I believe an upcoming JSON version, already in progress, will add support for inline comments). Inline newlines are only supported as \n\r and the like; there is no human-friendly indentation available except that which is likewise machine-friendly. These are in marked contrast to Turtle and RDFa which do support human-friendly content that is distinct from, yet intermingled with, their machine-friendly content.

Again, I have no RFC, REC, nor other "standard" or "specification" citation to offer, in significant part because the standard-setting bodies working on relevant specs in which I've participated agreed with my assessment above -- that non-RDF content of RDF-encoding media types is just as important as the RDF content, and all should be preserved -- and acted and discussed other things with the preceding as foundational and self-evident. Had there been any disagreement, I would have made sure that explicit statements to that effect were included!

kjetilk commented 2 years ago

I understand the desire, but I do not understand that there has been a clear historical path here.

LDP defines:

Linked Data Platform RDF Source (LDP-RS) as "An LDPR whose state is fully represented in RDF, corresponding to an RDF graph." which in very clear terms do not include whitespace, comments, etc.

Then,

Linked Data Platform Non-RDF Source (LDP-NR) as "An LDPR whose state is not represented in RDF. For example, these can be binary or text documents that do not have useful RDF representations. "

The class of resources that you want are clearly not included in either of these classes, this is a class of resources that have a useful RDF representation, but where the representation does include more than RDF that is significant. If that understanding was present in the LDP WG, I don't think it was in any way adequately expressed in normative statements.

I believe that we can satisfy this by introducing such a class, and perhaps we should (I certainly can feel the pain around HTML+RDFa, and I'm sure @csarven feels it even more). I just don't think that can be a requirement right now, even if NSS could support that.

TallTed commented 2 years ago

For the umpteenth time: I was an active participant in the LDP WG.

I am reasonably certain that I understand what we meant, frustrated that we failed to communicate that, and more frustrated that my clear statements of our intended meaning (which I believe have yet to be disputed by anyone else who participated in the LDP WG, leaving me with no feeling of having misunderstood our intention) to clarify others' misunderstandings and misinterpretations are discounted and/or ignored.

Our simple error here lies in having left out one word from the LDP-NR definition: "fully".

We did not intend to have three types of LDPR, with two identified and defined explicitly as noted above and one identified and defined only implicitly by the gap between those two.

We intended to have two types of LDPR, as explicitly identified above, i.e., LDP-RS which are "fully represented in RDF", and LDP-NR which are "not [fully] represented in RDF". Nothing in the definition above says that an LDP-NR cannot include RDF as part of its payload, only that RDF is not all of its payload.

csarven commented 2 years ago

We've been through this material and discussion several times. I've been content to assume that Ted is sharing knowledge of LDP in good faith for some time now. Unless the editors or authors of the LDP spec or LDP WG members can clarify the language, I suggest we take that as is. Debating whether the language in LDP is meaningful or intuitive against Ted's explanation of what's intended is not particularly useful at this point. (I've been down that road. We don't all have to.)

Having said that, the Solid Protocol - or at least some of the original servers and clients - neither assumed or required LDP compliance. (I know because I wrote a client that expected the servers to implement LDP but they didn't actually deliver despite the fuzzy claims.) NonRDFSource and RDFSource was not something adopted or deemed necessary for Solid. It seemed to create more concerns than actually being practical. From my perspective, this whole thing turned into moving the Solid Protocol forward without bringing the whole LDP baggage (YMMV). But we are not completely free since we do expect containers in Solid to behave like LDP(B)C. Perhaps we just need solid:Resource/Container (semi-serious suggestion).

TallTed commented 2 years ago

In early-ish days, Solid tied itself loosely to LDP, which was supposed to be a good thing because LDP servers were supposed to require minimal adjustment to also be Solid servers.

I am content to let Solid break from that loose binding IFF it doesn't lead to silently broken expectations of document preservation.

In my world, when I upload a document -- whether that be Turtle, HTML+RDFa, JPG, PNG, CERT, or otherwise; and whether or not the store to which I'm loading it knows how to extract information from that document -- to a document store of any kind, I expect that document to be retrievable in the same condition, with the same content, as when it was uploaded unless it was edited/changed by some process I explicitly either approved or initiated.

That includes when I write a Turtle or HTML+RDFa document to a Solid pod.

TallTed commented 2 years ago

"First, do no harm." Or more aptly, "First, lose no user data."

Also, there's the whole "distribution of effort" credo (which I can never bring to mind by the name others use for it) that says users should have to think the least, in order, after deployers, after programmers, after specifiers...

We (OpenLink) use a lot of Turtle in our line of business applications (like this document), among other things. If you take a look there, you'll find a fair amount of comment and whitespace content included to aid human comprehension and future text-editor-based revision.

# A rather excruciatingly detailed self-describing Turtle snippet
###
# First some PREFIX definitions, all right-aligned on the `:` ending the prefix
# The first one defines the most important prefix, that refers to this document, ":"
###
PREFIX      :   <#>
###
# Then some prefixes for commonly used ontologies, 
# all alpha-sorted ignoring whitespace, to avoid duplication 
# and to ease human parsing of this sample dataile
###
PREFIX  foaf:   <http://xmlns.com/foaf/0.1/>
PREFIX  rdfs:   <http://www.w3.org/2000/01/rdf-schema#>

###
# Then some RDF data
###
:this     a                  rdfs:Resource ;
          rdfs:description   """This text is to demonstrate the 
                                utility of inline whitespace
                                within Turtle data.""" ;
          rdfs:comment       "just a bit of text" .
###
# And then the document ends.
###

The more unfamiliar the realm of discourse, the more important such refinements are.

Now, users could describe one Turtle file with another, and annotate the first with line-number-based identifiers, instead of inline comments, but that puts a lot of work on the Solid user, which is not required by any other document repository ... but once your Solid server that's refusing to preserve Turtle documents in toto ingests the content of those two files, there is no guarantee that the line-number-based descriptions from the second will remain accurate for any Turtle file that is eventually output with the same RDF graph as the first, because there's no ORDER required by SPARQL, etc.

csarven commented 2 years ago

I'd suspect the WG captured some of the background in the mailing list, meeting minutes or the W3C issue tracker. I suppose we weren't compelled enough to dig into that - but that shouldn't stop anyone from doing that now.

LDP's constraints re "fully" is (unnecessary complex) plumbing either way.

If we need a class to distinguish RDF stuff from non-RDF stuff, we can use solid:RDFDocument or solid:RDFSource (based off rdf11-concepts) :) The rest is just media types, multiple representations, graph comparisons.


Sure, that "loss of quality" is not about RDF and the request is specifically for a representation that's a lower quality (i.e., usually in one of the concrete RDF syntaxes other than RDFa). Shrug.

If we are not talking about RDF graph comparison, "loss of quality" can be expected when going from any concrete RDF syntax to another.

kjetilk commented 2 years ago

@TallTed I certainly acknowledge your contribution to LDP and Solid. It is just that expectations aren't much what implementors base their code on.

The difference, from an implementation standpoint given those assumptions is large, for RDF sources, it is clearly legitimate to persist only the RDF graph and so you use technologies available for that like a quad store. and serialize when people make requests. For non-RDF content, it is clearly legitimate to not persist the RDF graph and so, it creates a divide. For a third class, you'd have to do both, and so, the specification needs to be abundantly clear about it.

I am already quite convinced that we need it, I'm not as hard to convince as @RubenVerborgh , but still, there is a piece of work that needs to be done.

TallTed commented 2 years ago

@RubenVerborgh @kjetilk -- Am I not, as a Solid user, able to store arbitrary documents in my pod, as well as application-generated data files? I know, everybody's first thoughts are about pictures and movies, but textual documents are also among the commonly shared sorts of thing, and there's no reason why I, as an early Solid adopter, might not hand-edit some Turtle about those pictures and movies, or about some stuff that's not now (and might soon or might never be) in my pod.... What you're telling me here is that I must not be a tinkerer, I must not hand-edit my Turtle to please me, because only the app authors are allowed to do this.

kjetilk commented 2 years ago

@TallTed you most certainly are able to store anything. As for tinkering with files, you can do that too with a Solid server that supports that, like NSS and CSS with certain backends. But you'd also have to be aware that if you're not editing through means provided by the protocol, it is going to be things that are hard, like you need to make sure that containers are updated if you create a file on the file system, that etags and last-modified times are updated, that you protect the data from unauthorized access, etc. It doesn't exclude that kind of assumptions, but it allows servers to make an assumption that it persists the RDF graph.

kidehen commented 2 years ago

@RubenVerborgh @kjetilk -- Am I not, as a Solid user, able to store arbitrary documents in my pod, as well as application-generated data files?

In text/plain.

When did that become the case? You've always been able to store "text/turtle" by hand to a Solid Pod.

A Solid Pod provides a filesystem-like experience to Agents (users or machines).

I don't understand your response here, please clarify.

/cc @timbl @TallTed

TallTed commented 2 years ago

@RubenVerborgh -- I would be delighted to not need to keep harping on this issue, but where comments by you and others in other issues demonstrate the validity of the assertions I've made here, which have yet to be accepted here as valid by you and others, I think it is valid to say so there.

Your own sample JSON data incorporated comments which made that sample invalid as JSON, and when I flagged that, you noted that you could have used Turtle where your inline comments would have been valid. Are you now asserting that those comments were not important enough to retain? If so, I have to wonder why they were important enough to include in the first place ... and further, how other readers of your JSON snippet were meant to fully comprehend it, as the content of those comments was not replicated outside of the JSON snippet.

TallTed commented 2 years ago

You know very well that everyone on that page knows JSON does not do comments, and Turtle does.

You are making presumptions about potential readers with which I disagree. There is no certainty that "everyone on that page" knows any particular thing, including the disparate (lack of) support for comments in JSON and Turtle.

It is my feeling that you want to exile any comments about full preservation of documents which you don't think need to be fully preserved, and gloss over even your own comments elsewhere which support my position and undercut yours.

I would have thought it clear that I am not trying to sneak anything in anywhere. Rather, I am speaking in full view, where it seems relevant to me to do so.

I will also thank you not to deride my comments as mere "vent[ing] about #342", which suggests that there is no merit whatsoever to my position, which you have justified simply by the fact that my position is not yours.

If Solid is going to go forward without promising to preserve the full content of documents which Solid has decided are not worthy of full preservation, then I believe Solid will die a deservedly painful death, and has the potential to ruin many users' days along the way, as they discover the loss of their data. I would prefer that neither of these fates come to pass, and rather that Solid preserve the content of my, and their, documents.

TallTed commented 2 years ago

The most straightforward step is to warn the user when document content -- whether it's HTML within RDFa, or comments and/or whitespace within Turtle, or something else -- is going to be lost through that PATCH.

I believe there's been general agreement that HTML+RDFa documents should be preserved, and so they might not be PATCHable. I think the same logic should be applied to any other document formats a/k/a media types that (may) include both "in-band" RDF and "out-of-band" non-RDF, including but not limited to Turtle.

The requirement to update resources via PATCH could be adjusted to require the ability to update only those resources which do not lose other content (which may may not be "near" the intended update) when so updated. In other words, PATCH might be required only for pure RDF content, such as JSON-LD documents, which cannot contain out-of-band, non-RDF data. Alternatively, PATCH might be required only for RDF which is stored in a triple/quad store, whether it arrived there by extraction from some impure, partially-RDF document or by ingestion of some pure, wholly-RDF document; the originating document could be retained, unchanged, while the RDF in the triple/quad store gets PATCHed.

My proposed requirement is not to "preserve comments" but to "preserve all document content other than that which is explicitly and intentionally and informedly altered or deleted, with user consent, by user action" which might take the form of a confirmation dialog regarding a Turtle document, e.g., "Your PATCH of these statements within the document will cause the following to be lost: {possibly line numbered, blah blah blah, etc.} Proceed? Y/N"

TallTed commented 2 years ago

In other words, PATCH might be required only for pure RDF content, such as JSON-LD documents, which cannot contain out-of-band, non-RDF data.

JSON-LD does have structure that is non-RDF data (framing/structure, whitespace, properties not mapped to RDF).

I believe the structure to which you refer is inherent to JSON[-LD], and that that structure may be transformed in any direction multiple times without loss of any information, such that any structure may be retrieved/regenerated from any other structure. I don't think JSON[-LD] permits arbitrary whitespace similar to that permitted by Turtle. If it does, then that JSON[-LD] whitespace should also be preserved.

I know that JSON includes JSON-LD (i.e., JSON-LD documents comprise a subclass of JSON documents); I do not think the reverse is so. In other words, if there are properties in a JSON document that do not map to RDF, that document is not properly treated as nor considered to be JSON-LD, even if it is named .jsonld and/or media typed as application/ld+json; it is JSON, application/json.

any other document formats a/k/a media types that (may) include both "in-band" RDF and "out-of-band" non-RDF

I think literally all of them do. How do we distinguish cases where preservation is required?

When in doubt, preserve.

"Your PATCH of these statements within the document will cause the following to be lost: {possibly line numbered, blah blah blah, etc.} Proceed? Y/N"

That is not an "end user"-level statement though; what can end users do with this?

End users receiving that alert may decide to use a different client tool and/or server to edit the document, such as one that does not use PATCH, perhaps by GETting the entire Turtle document for explicit text editing followed by a PUT.

They might also choose to sacrifice their comments, whitespace, etc.

Alternatively, PATCH might be required only for RDF which is stored in a triple/quad store

I don't think that makes sense in the context of the protocol, which does not know about the backend. Furthermore, apps like Databrowser make extensive use of PATCH; i.e., the use cases for automated modifications of RDF resources at the moment by far outnumber those for manual modifications.

Applications which are maintaining their own documents would not be restricted in such maintenance; those documents would almost certainly not have human-friendly whitespace, etc.

My main concern is with documents the user has chosen to store in their Solid Pod, where they should have no expectation of other tools messing with their documents' content.

That pertains to my use case question above; I'm looking for a statement on the level of "an end user wants to allow Alice to access a set of photos" (as opposed to "the end user should be able to edit an ACL document to insert a triple").

"A power-user wants to store a Turtle document alongside a set of photos, with manually edited Turtle descriptions of those photos. They want to preserve indentation and other semantically invisible but syntactically visible whitespace, such as column-aligned predicates and objects, to ease future edits of this Turtle document. They also want the content of this document to be ingested (but not maintained) by the gallery tool (which may not exist as of this writing), such that rdfs:comment values are displayed (but not editable) as photo captions. The gallery tool might ingest and then change the stored values; the user might choose to over-write the gallery tool's changes from their edited or unedited Turtle file at some point in the future, or they might choose to maintain the gallery's RDF distinct from the Turtle file."

TallTed commented 2 years ago

You're correct on JSON-LD with un-URI'd terms; my bad. That said, "ignore" is not the same as "delete" or "drop" or "blackhole". Which is, it seems to me, another argument in favor of what I'm arguing for: i.e., PRESERVE THE CONTENT YOU DON'T UNDERSTAND OR RECOGNIZE. (Remember that this is also how HTML-based web browsers "fail elegantly" with HTML tags they don't recognize -- not by deleting nor by not rendering the content within such unrecognized tags, just by treating those tags as if they weren't present.)

text/plain is not Turtle, nor JSON, nor JSON-LD. Some tools can be told to ignore the advertised media type and/or the filename extension, but many tools cannot. I do not think text/plain is an appropriate solution here.

I have long been arguing against allowing PATCH against HTML+RDFa documents unless the document around the triples/quads being patched is preserved -- which could result in disagreements between the human-facing HTML and the machine-facing RDF, which I'd be OK with accepting as a side effect of such action -- but the general consensus has seemed to be to treat HTML+RDFa as "out of bounds" for Solid, at least as far as PATCH is concerned.

If there is no such thing as a "document which is entirely represent[ed/able] in RDF", then there is no such thing as an ldp:RDFSource. However, I do believe that ldp:RDFSource documents do exist, and that those are acceptable as targets for Solid PATCH. These might include documents produced by Solid-based apps, and, again, I have no argument with these being PATCH targets.

All that said... I do not find it acceptable for Solid (nor any other service or server) to destroy data in my documents of any format or media type without at bare minimum telling me that's about to happen (e.g., "This document is about to go through a lossy transformation, from HTML+RDFa to N3, retaining only the RDF triples found in the original document"), and waiting for my approval to do so, before doing it. That's not going to change.

elf-pavlik commented 2 years ago

@TallTed would dedicate auxiliary resource preserving verbatim representation satisfy your requirements? I think one would just need to anticipate that RDF in it can get stale but at least that would provide some way to retain the formatted version that was provided at some point. There would be still a lot of problems to answer but I thought we could try brainstorming a bit.

damooo commented 2 years ago

@elf-pavlik auxilliary resources can only be LDP-RS(s) at present. Which have same issues of non-auxiliary rdf resources in preserving comments, whitespace etc. If they are to be saved as text/plain as mentioned by @RubenVerborgh, then we have to #329 Allow Non rdf resources as auxiliary resources.

rubensworks commented 2 years ago

As I understand it, most Solid end-users will likely never hand-edit RDF files, so I do think it makes sense to not preserve comments and whitespaces by default.

However, instead of using text/plain for marking that an RDF document should preserve such non-RDF data, what about introducing a profile? (e.g. http://www.w3.org/ns/solid/terms#preserveNonRdf)

This would allow applications to still see RDF files that need such non-RDF data preserved with their proper media type, but still be able to take into account the fact that such non-RDF data must be preserved, so that operations such as PATCH may have to be applied differently (or are impossible).


For example:

Content type: text/turtle

# This comment may not be preserved
:this a rdfs:Resource.

Content type: text/turtle;profile=http://www.w3.org/ns/solid/terms#preserveNonRdf

# This comment will be preserved
:this a rdfs:Resource.
damooo commented 2 years ago

There may be reasons to preserve comments, even if users don't hand edit them. Suppose we want to store some turtle config files in solid pod, and retrieve them back like one can do in google drive, etc. we will loose important information.

Thus we can't use solid to just store and retrieve arbitrary folder of source code with turtle configuration files, and edit them as code, with valuable comments. There are times when files are used not just as backend to store information, but as files themselves.

using profile seems better way forward.

damooo commented 2 years ago

Thus we can't use solid to just store and retrieve arbitrary folder of source code with turtle configuration files, and edit them as code, with valuable comments. There are times when files are used not just as backend to store information, but as files themselves.

And in cases like these, where we use ttl files as part of source-code/config-files, and persist them in solid, with same dc:format, we may have to think whether these files should contribute to resultant knowledge graph of a pod.

Thus we may(?) have to distinguish between those rdf-resources which contribute to knowledge graph(-store), and resources, which may have mime-type of turtle|json-ld|etc, but are just opaque documents, which doesn't contribute to knowledge graph(-store). Source-code files are examples for second kind. Intention of storing them in solid is to just to host them, but not to add to knowledge graph of pod.

as mentioned by @csarven usage of two different rdf:types solid:RDFDocument and solid:RDFSource may distinguish these two cases.

damooo commented 2 years ago

Just to think aloud, If we distinguish resources based on their essence and whether they contribute to resultant shared knowledge graph of a pod or not, irrespective of their mime-type, it may solve ambiguities.

First kind can be RDFSources (near to LDP-RS), which contribute to knowledge graph. Their essence is just knowledge. And representations used in transit for them are just manifestations of that knowledge. These resources are rdf-patchable as with what ever patch format solid-spec recommends.

Second are Non-Rdf-Sources(near to LDP-NRS), which now can have any mimetypes, including ttl. but they don't contribute to resultant knowledge graph. Their essence is their bytes themselves. lossy conneg can be allowed. And they are not patchable through knowledge patching mechanisms. They may be patchable by custom standards of file-diff/json-diff/etc. if required.

elf-pavlik commented 2 years ago

However, instead of using text/plain for marking that an RDF document should preserve such non-RDF data, what about introducing a profile? (e.g. http://www.w3.org/ns/solid/terms#preserveNonRdf)

This direction, seems to me, has great potential to explore further. Few questions that come to my mind:

rubensworks commented 2 years ago

I like the profile direction in general; perhaps we should make it more generic and have something like http://www.w3.org/ns/solid/terms#preserveSyntax.

Based on @elf-pavlik's earlier comment, http://www.w3.org/ns/solid/terms#verbatim might also be an option. But perhaps a bit too generic for a profile.

TallTed commented 2 years ago

I think requiring users to set a profile on a media type, or to force them to set text/plain rather than text/turtle, imposes a substantial burden on them, especially if they only have to do this if/when they store a manually edited Turtle file, and if they forget to set this special profile, they'll lose their content -- and now it's deemed to be their fault, because of course they should know all the esoteric internal workings of the Solid filestore vs every other filestore they work with.

As to lossiness...

A lossy GET (such as where the target resource is a RAW image and conneg is for JPG, which is always a lossy transformation, commonly handled by a webserver helper like ImageMagik) is perfectly fine -- because the requester is typically also provided with headers that can lead them to re-GET the original untransformed media type, and in any case the original (e.g., high-res RAW) is preserve for any future activities.

A lossy PUT is not typically acceptable nor permitted.

A lossy POST may be acceptable and is generally permissable, IFF the user is fully informed that their original (e.g., high-res RAW) will not be stored at the target location; rather, a lossy transformation result (e.g., JPG) will -- so they'll need to preserve the original elsewhere, if retaining it is desired.

To a fair extent, these issues flow from Solid's declared usability as both an application substrate and a human-accessible filestore. Drop the latter, and many things become permitted that I believe are not acceptable and therefore should not be permitted if the latter role is to be maintained.

elf-pavlik commented 2 years ago

I think requiring users to set a profile on a media type, or to force them to set text/plain rather than text/turtle, imposes a substantial burden on them, especially if they only have to do this if/when they store a manually edited Turtle file, and if they forget to set this special profile, they'll lose their content -- and now it's deemed to be their fault, because of course they should know all the esoteric internal workings of the Solid filestore vs every other filestore they work with.

I think if someone uses some low-level client directly we can expect them to deal with profiles. If someone uses higher-level application it's up to the developers of that application to implement support for discussed profile or not. Any user can always choose an application that provides the features they need.

"Your PATCH of these statements within the document will cause the following to be lost: {possibly line numbered, blah blah blah, etc.} Proceed? Y/N"

That is not an "end user"-level statement though; what can end users do with this?

End users receiving that alert may decide to use a different client tool and/or server to edit the document, such as one that does not use PATCH, perhaps by GETting the entire Turtle document for explicit text editing followed by a PUT.

I find it ok if some application fails the user once - where it just loses some formatting/comments. When that happens users can look for an application that preserves them. From the spec perspective it seems reasonable to provide a straightforward way for applications to offer such a feature.

Given that every user having appropriate access to update the resource can choose their preferred application. It may be worth dedicating some time to think about supporting version control, so given storage can preserve all the history of every verbatim representation, everything else probably always will lead to loosing some information.

TallTed commented 2 years ago

The current status is, and always has been: RDF documents can be freely patched.

The Solid definition of "RDF documents" is vital here, and apparently must be revisited in all related discussions, because it continues to be a moving target, to my eyes.

The ldp:RDFSource definition of "LDPR whose state is fully represented in RDF, corresponding to an RDF graph" is clear to me. I believe the LDP WG which produced it, in which I was an active participant, shared my understanding of its meaning. It does not include Turtle with comments and/or whitespace formatting, nor HTML+RDFa, nor any other document no matter how much RDF is encoded therein which is not fully represented in that RDF; these are all ldp:NonRDFSource even though their content includes RDF.

The current Solid definition of "RDF documents" as used here is becoming clear to me as "any document that includes RDF, where the Solid team considers the non-RDF content to be unimportant". That will lead me to minimize my use of Solid servers, no matter how much I agree with the base premises that the Web should be Writeable as well as Readable, and that my data is mine and should be under my control.

Solid ... only promises to be a Linked Data store.

That is not the message that was communicated to me by any interview with @Timbl nor any early conversations about Solid's promises, regardless of what was actually implemented in early Solid servers (of which I have understood Node Solid Server to be the primary "reference implementation" focus of 0.9 spec development, the result of which I have understood to be not a forward-looking prescriptive spec, but a backward-looking descriptive spec, with some vague hopes of shifting to a forward-looking prescriptive spec for 1.0 if not later).

So.

Optimizing for the 80% of use cases, as you characterize them, is not unreasonable IFF the effect of such optimization is clearly communicated, and that includes warning users who try to store something which content is not going to be wholly preserved, such as Turtle with comments or with long sequences of whitespace (the sniffing out of which I would think more complicated than simply retaining the original document, but what do I know?).

Now, you've characterized "retaining the entirety of Turtle [and HTML-RDFa, and other ldp:NonRDFSource] documents" as a burden on 80% of use cases (or did you mean users?), with which I do not agree. You're also comparing the burden on some percentage of Solid server operators (who will, I admit, need to allocate some additional storage because of this requirement) and/or developers to the burden on some percentage of Solid end-users (or perhaps Solid app developers). Rule of thumb is that the burden on those Server and/or App (and/or spec) developers should be allowed to increase if that increase lowers the burden on the end-users of those tools.

TallTed commented 2 years ago

The media type of text/turtle includes comments, whitespace, etc., as part of the media. You continue to argue as if these syntactically acceptable and humanistically important elements are valueless to those humans.

Do you not comment your code, where the language allows for such? Do you expect those comments to be retained or dropped, when you store the code somewhere?

I keep raising HTML+RDFa because it is a parallel use case. Yes, HTML+RDFa starts with HTML and adds RDF, and Turtle starts with RDF and adds comments/formatting -- but the result in both cases is a document with both human-targeted and machine-targeted content.

It's a different situation if the RDF-containing media type does not support comments or machine-invisible, human-visible formatting (a/k/a whitespace).

TallTed commented 2 years ago

Comments (and indents) in C are not compiled, but they are retained in the C language documents! Just as I want my comments and indents in Turtle to be retained in my Turtle documents!

How can this be so hard to understand?

But look, we'll need to agree to disagree on this one. I'll argue that removing comments from Turtle is the same as removing whitespace from JSON, because I write for machines, and you will argue the opposite, because you write for power users. And for lack of written evidence, neither of us are going to change our position. So any attempts here are futile.

Your decision loses data.

This is, or should be, an absolutely unacceptable non-starter for any system that presents itself as usable as a document store, regardless of the flavor of the documents it stores.

@Timbl -- What say you? Do you not want and expect your inline comments to be preserved in your Turtle documents, unless you do explicitly consent to the serialized, materialized Turtle being transformed into abstract RDF and/or loaded into an RDF store, thereby dropping your comments and any whitespace formatting into the bitbucket?

If this is just "the default" way of NSS (and whatever other Solid Server implementations), then there should be a switch somewhere -- and I don't much care where, except that it should be adjustable by the user, not only by the SS admin who is just as likely to say "this is the default and we've always done it this way and we're not changing anything for you", and it should be trivially accessible, minimally brought to the user's attention upon their attempt to store a document from which data will be lost if it is stored with the settings in that position. The default on a new instance should be to preserve data -- i.e., to preserve comments, indents, etc. -- and I think that if this is too difficult to make the default on upgrade instances (i.e., making the setting for existing users to not preserve data), then those existing users will just have to get used to the new behavior -- because discarding data without explicit user consent is not and should not be acceptable.