URI construction for DMLex fragments

oasis-tcs / lexidma

OASIS Lexicographic Infrastructure Data Model and API (LEXIDMA) TC: A repository designed for use in development of TC chartered work products and test suites. https://github.com/oasis-tcs/lexidma

Other

7 stars 8 forks source link

URI construction for DMLex fragments #111

Closed vojtech-kovar closed 3 months ago

jmccrae commented 5 months ago

Is this related to #97?

vojtech-kovar commented 5 months ago

Is this related to #97?

Yes -- sorry for not mentioning that before, and thanks you volunteered for reviewing :) We had a discussion about that at the meeting today after you left, and there will be some changes -- so maybe wait with the review after I implement the changes (tomorow or Monday, I hope).

vojtech-kovar commented 5 months ago

notes from today's meeting:

we want to do IRIs, not URIs
change the structure so that "Optional roots" is 3.1, Fragment identification is 3.2, Fragment URIs is 3.2.1, lexicographicResource is 3.3
reformulate/hedge the authoritativeness of the instructions, say something like it is "recommended for dictionaries living on-line" and "recommended method for inter-operability"
point to here from the linking module (again, not in any authoritative way, rather as a recommendation and for reader's convenience)

feel free to add if I forgot anything

jmccrae commented 5 months ago

I have some doubts about this scheme:

Some elements can be assigned ambiguous empty IDs: collocateMarker and etymology both have one optional unique property, that may be missing, so in this case their identity translates to an empty string.
Some IDs will be very long: definition has only text as its unique property, this may lead to a very long identifier as definitions can be quite substantial in a dictionary
listingOrder is not used as a property, but would be the obvious choice for many elements. e.g. currently we would have something like http://www.example.com/lexicon/entry/cat~1~noun/sense/small+furry+animal and this could be simplified to http://www.example.com/lexicon/entry/cat~1~noun/sense/1
The order of elements with multiple properties is not clear, e.g., should it be cat~1~noun or cat~noun~1
It should be noted that the second case (single unique property, arity=1, value is an object) actually does not occur in the spec
We should give the result of applying this schema along with each element definition

vojtech-kovar commented 5 months ago

Thanks for the notes, let me add my thoughts:

Some elements can be assigned ambiguous empty IDs: collocateMarker and etymology both have one optional unique property, that may be missing, so in this case their identity translates to an empty string.

Not sure if I understand correctly: Do you mean e.g. two different etymology objects under one entry, both with missing description? According to my understanding of UNIQUEness, this should not be allowed -- because once two objects at the same level miss a UNIQUE identifier, it is no more UNIQUE, the objects cannot be distinguished by this property. (NB there is the same situation with sense, both UNIQUE properties are also OPTIONAL.) If a property is marked both UNIQUE and OPTIONAL, I understood it's because we want to allow a single etymology (or sense) without description (or definition) under each entry, not multiple. Am I reading it wrong?

It could anyway be stated more explicitly in the description of UNIQUEness.

Some IDs will be very long: definition has only text as its unique property, this may lead to a very long identifier as definitions can be quite substantial in a dictionary

Yes, that's right -- I've asked about it and we have discussed this at the meeting after you left, and even considered an option of some hashing, but we agreed we prefer readibility and transparency to compression.

listingOrder is not used as a property, but would be the obvious choice for many elements. e.g. currently we would have something like http://www.example.com/lexicon/entry/cat~1~noun/sense/small+furry+animal and this could be simplified to http://www.example.com/lexicon/entry/cat~1~noun/sense/1

I am against using listingOrder -- you are right it would be easy to use (and short), but if you use it as a link and then the listing order changes without changing the link (which can happen anytime if the resource is not frozen), the link will still work (i.e. nobody will notice anything, everything will be valid etc.) but it will point to a wrong object. I think we want to avoid that.

The order of elements with multiple properties is not clear, e.g., should it be cat~1~noun or cat~noun~1

That's right, thanks for spotting -- I will state that explicitly.

It should be noted that the second case (single unique property, arity=1, value is an object) actually does not occur in the spec

We should give the result of applying this schema along with each element definition

I can do that, too, I just didn't want this feature be over-presented (maybe it's not that important :) ) -- what do others think?

vojtech-kovar commented 5 months ago

I have now implemented the changes we agreed on, please review if you can :)

jmccrae commented 5 months ago

I understood it's because we want to allow a single etymology (or sense) without description (or definition) under each entry, not multiple. Am I reading it wrong?

In fact, it is possible to have multiple etymologies without description under the same entry, this is the problem.

jmccrae commented 5 months ago

Another issue is that the fields are not identified so in some cases the identifier may be ambigous

<entry>
  <headword>foo</headword>
   <sense>
    <indicator>x</indicator>
  </sense>
  <sense>
    <definition>x</definition>
  </sense>
</entry>

Both resolve to http://www.example.com/lexicographicResource/entry/foo/sense/x

jmccrae commented 5 months ago

Comment on empty specifiers should be added before acceptance of this PR

jmccrae commented 5 months ago

A couple more potentially ambiguous results.

<entry homographNumber="0">
  <headword>test</headword>
</entry>
<entry>
  <headword>test</headword>
  <definition>0</headword>
</entry>

<pronunciation soundFile="x"/>
<pronunciation>
  <transcription>x</transcription>
</pronunciation>

I checked the others :)

jmccrae commented 5 months ago

One further comment, not even sure if this a bug, but it is not possible to construct a fragment identifier for member as there are not unique properties for relation

vojtech-kovar commented 4 months ago

One further comment, not even sure if this a bug, but it is not possible to construct a fragment identifier for member as there are not unique properties for relation

Yes, that's correct -- the procedure cannot live without the UNIQUE identifiers. I tried to say it by the following sentence:

DMLex does not define the structure of IRIs for object types without UNIQUE properties.

should I add anything to it?

jmccrae commented 4 months ago

The uniqueness issues seem to be fixed. Although we still need a resolution to #123 for senses.
I find the choice of which elements can be addressed to be rather arbitrary and I cannot see how this fits with any use cases (e.g., why not relation?)
I am against using listingOrder -- you are right it would be easy to use (and short), but if you use it as a link and then the listing order changes without changing the link (which can happen anytime if the resource is not frozen), the link will still work (i.e. nobody will notice anything, everything will be valid etc.) but it will point to a wrong object. I think we want to avoid that.

I see this problem with listingOrder, but currently we also change the URI every time a unique element (e.g., definition) changes, and this is tricky to implement in a dynamic web application use case. We could allow listingOrder to be used in an XPath-like syntax so we could refer to a sense as

EITHER
http://www.example.com/lexicon/entry/abandon~0~verb/sense/0~/to%20suddenly%20leave%20a%20place%20or%20a%20person/
OR
http://www.example.com/lexicon/entry[1]/sense[1]

Are we concerned about the URL maximum length (2048 bytes)? It seems very easy to reach with this very verbose URL scheme

jmccrae commented 4 months ago

I have thought about this over the weekend and I see four key issues with the proposal as it stands

It doesn't satisfy some use cases: There are probably three applicable use cases, here. Firstly, to support editing environments using a dynamic server and each server should implement the URL scheme as defined here. However, most existing such interfaces I can find (e.g., https://en-word.net/, https://en.wiktionary.org) seem to use a mix of fragments and paths to identify content. I cannot find a single editor that provides a unique page for elements such as senses and definitions, which is implied by the @vojtech-kovar and @mjakubicek proposal. However, my fix does not solve this either as is clear as forcing the lexicon to be edited on a single page is not viable. The proposal of @vojtech-kovar and @mjakubicek seems incompatible with the use case of static hosting and exchange and also with the use of conversion tools.
It does not improve interoperability: The goal of this PR is to provide a "method for addressing DMLex objects present on-line" in order to improve "general interoperability". The overall goal of this standardisation is to help producers and consumers to work through a standard model. This PR requires data producers to adopt a particular IRI scheme in order, however there is no clear idea of what these IRIs should resolve to under HTTP. As such, the usefulness for consumers is not clear. In other words, we are building an addressing system without knowing what is at these addresses! This puts a burden on producers without providing instructions that are helpful to consumers
The identifiers are unstable. As discussed with @vojtech-kovar, the argument against using listingOrder, changes to the data change the identifier, and so the identifiers are unstable. This is a general problem with this scheme. For example, a minor change to a definition would require updating the parent element's ID (sense), siblings' IDs (example) and all incoming links. I think this is technically very challenging (in conflict with @michmech's vision of the model) and can probably only be implemented by search/replace/hope or using another internal identifier scheme (in which case what is the point of this?).
Identifiers are long and involve ugly tradeoffs: The identifiers this scheme proposes are very long and this may lead to technical issues. We also have to make some ugly tradeoffs to avoid ambiguity, for example including a homograph number in every identifier (word~0~noun) even when not needed and adding 0~ in many places. Aesthetics are not a showstopper, but they will certainly limit the adoption of this model

As solutions, I see the following approaches

Ignore the problems: This scheme is marked as not required so we could just accept and move on and let the implementers handle any problems.
Try to fix the issues: Issue 1 seems quite thorny as the position of the # symbol in a URI is important technically and I am not sure how we define enough modelling to allow implementers to put it anywhere. An alternative would be to not define full IRIs, but make this something like XPath for DMLEX, but not tied to the XML serialization. However, even if we solve Issue 1, Issue 2 seems fairly intractable and I don't think we avoid Issue 3 and 4 without a radical redesign
Don't reinvent the wheel: We could instead just say that all identifiers are user-specified and documented in the data model. The advantages of this are:
- It is already mostly implemented for sense, entry and collocateMarker
- It solves the issues above
- It conforms well with the xml:id in XML serialization
- It conforms even better with the RDF serialization and avoids blank nodes
- It is simple for us and implementers

mjakubicek commented 4 months ago

This discussion gets repetitive so let me just summarize why most of the objections are either false or largely missing the point of this PR.

First of all it needs to be emphasized that the specification is very clear about the fact that it describes an addressing mechanism on the model level and then there are serialization-specific addressing mechanisms which anyone is free to use (this would be e.g. XPath/XQuery for XML).

This answers Objection number 1, because if we are talking about static hosting of data files, those files are necessarily serialized in some format, and then a serialization-specific addressing mechanism should be used. It is therefore false that this use case is not supported, on contrary there are a number of options to choose from, and all are depending on a particular serialization format. It shall be emphasized too that coercing model-level descriptions towards particular serialization is a malpractice to be avoided.

The Objection number 2 says "PR requires data producers to adopt a particular IRI scheme" which is not true (it is optional), and generally completely ignores the primary motivation behind a model-level addressing mechanism, i.e. being able to address without the restrictions of any particular serialization method. This objection for reasons not explained instead keeps talking about a request-response processing mechanism, which again, is not the primary motivation behind the addressing, and can be easily done using any serialization-specific addressing mechanisms. Again, the primary motivation of the model-level addressing is to point to a particular DMLex object in serialization unspecific way; not defining a request-response round-trip.

The issues described in Objection number 3 were also discussed multiple time and they are not very relevant to this PR. All this is intentional and in line with best lexicographic as well as data maintenance practices to prevent unintentional data degradation. The principles of DMLex are to remove processing complexity where it is not necessary, not where we would arbitrarily wish to do so. The fact that many tools currently to dot exercise these integrity checks suggests that it is even more so important to promote it in the standard.

Objection number 4 is true but it is important to realize that the links are not meant to be human-processed, or human-presented in the full form. They would be machine processed and visualized in implementation-specific ways that will suite the user/device/situation context. So yes, the links could be sometimes long a ugly, but also in many cases rather short and easy to interpret.

To sum up, I find all the objections completely invalid and do not understand the motivation behind bringing them again and again without any reasonable justification.

jmccrae commented 4 months ago

You are asking why this is important, so I will try to reiterate this:

I know that identifiers are defined at the model-level, which is an abstract level. Abstract models need to be instantiated, and my argument is that this proposal seems impossible to instantiate on any real-world, serialized data (all data is serialized somehow in the real world). You claim that serializations should use different mechanisms such as XPath, are you implying that this is a proposal with no real-world (i.e., serialized) applicability?
This proposal defines IRI, which identify resources. Resources can be webpages or XML documents and usually are, but they can also be abstract concepts. For representing references to abstract concepts, the Resource Description Framework was invented. As an RDF expert, I have concerns about this proposal. In particular, addressing objects in a serialization-unspecific way usually requires methods like content negotiation to be implementable.
I have implemented this proposal and it took nearly 800 lines of code. I had to change the proposal in a way you find unacceptable to make this work (in order to obtain valid relative URIs in RDF serialization). It is not simple and the implementation discovered several other bugs (#116, #122, #123 and several documented on this PR). There is also a very simple alternative proposal (Solution 3).

vojtech-kovar commented 4 months ago

In the beginning of all this we wanted recommended addresses for all DMLex objects, based on the data (and namely the values of the UNIQUE properties), not arbitrary IDs, nor a particular serialization. It was all about (and only about) suggesting unique identifiers, not prescribing how they should behave if used in HTTP requests or in any other particular scenario. I get it now that @jmccrae does not like this very principle (to put it mildly), on the other hand we agreed we will do it in a meeting with all of us present, so I took it as agreed.

It was crystal clear from the very beginning that it is not possible to devise a method of addressing that will guarantee that all the possible use cases will work out of the box. I am pretty sure that we cannot even predict any substantial part of the possible use case scenarios, we can just bring some arbitrary examples.

But now we are (John is) bringing one arbitrary use case after another and argue it does not work out of the box for them. Well, it doesn't. It is not possible to satisfy everyone. (And I don't like trying to satisfy all the use cases we can think of, especially by complicating the DMLex model itself, like we did on the last meeting with the new property deciding if '/' or '#' is used. None of the use cases, nor the whole addressing itself, is so important that it would be worth making the model more complex.)

So, instead of fiddling with arbitrary use cases, I think we should answer the main question: "Do we want a model-level mechanism as described in the first paragraph, even though it does not satisfy all the use-cases perfectly?" Do we?

I think the model-level addressing brings a choice: either use this, even if it requires some extra effort with particular formats/setups, or use a serialization-specific addressing and/or their own IDs if it's more convenient. The advantage of the former option would be universality (indepedence on a particular resource, its serialization format and arbitrary IDs -- if you are e.g. a dictionary aggregator, this could make you happy) and readability (even if the address leads to nowhere, a human is able to decode/fix it, unlike an address with arbitrary IDs.) Of course, we can as well decide to drop all this (John's option 3, and also the current status) which leaves only the latter option.

@michmech @DavidFatDavidF please comment

jmccrae commented 4 months ago

I think that this is getting a bit out of hand for what is a small part of this overall great project. When summarising the issues discussed in this long thread I have been accused of "bringing them again and again without any reasonable justification" and by defining three use cases I am accused of "bringing one arbitrary use case after another". Can we chill it please?

As I have made clear, I am open to compromise (Option 2) although as is clear, my personal opinion is that user-defined identifiers (Option 3) would be superior to content-based ones.

These concerns are based on blocking technical issues that have become clear to me from implementing this system and I have outlined them clearly above.

To implement the compromise option (Option 2) I would propose the following text:

<para>Every top-level model object may be assigned one or more identifiers 
that uniquely determines the path in the DMLex tree structure. These can be used to construct IRIs, by 
appending them to the IRI of the root object. The IRI of the root element is the value of its attribute <literal>lexicographicResource.uri</literal>, converted to IRI according to the algorithm specified in
 [<link linkend="bib_rfc3987">RFC 3987</link>]. IRIs can be constructed in a schemes such as 
follows:</para>

<para><literal>lexicographicResource.uri/objectTypeName/objectID</literal></para>
<para><literal>lexicographicResource.uri#objectTypeName/objectID</literal></para>

<para>Other schemas may be adopted by applications. This standard does not mandate the adoption of any 
IRI schema or describe what kind of resources are located by IRIs constructed in this way.</para>

etc...

Then all examples are changed so that they do not include the HTTP URI (e.g., entry/cat~1~noun/sense/0~small%20furry%20animal instead of http://www.example.com/lexicon/entry/cat~1~noun/sense/0~small%20furry%20animal). We continue to define objectIDs but do not define IRIs based on them. Our identifiers no longer start with http and thus don't depend on a serialization.

This satisfies Problem 1, as it is much more vague and does not mandate a URI schema so more use cases can be satisfied. Problem 2 is mostly side-stepped as this proposal now doesn't require anything of producers or consumers of data. I also think it is closer to what @mjakubicek has in mind, as he doesn't want a "request-response" mechanism based on serialization, while an HTTP URI requires that you can make an HTTP request and receive a serialized response.

I would also reiterate the proposal to also allow object IDs by listingOrder

EITHER
entry/abandon~0~verb/sense/0~to%20suddenly%20leave%20a%20place%20or%20a%20person
OR
entry_1/sense_1

The adoption of listing order as an alternative mechanism would solve Problem 4, and Problem 3 would be reduced as implementers can choose the option that is more stable for their application.

I am happy to turn this into a PR if others are happy with this.

mjakubicek commented 4 months ago

You are asking why this is important, so I will try to reiterate this:

I know that identifiers are defined at the model-level, which is an abstract level. Abstract models need to be instantiated, and my argument is that this proposal seems impossible to instantiate on any real-world, serialized data (all data is serialized somehow in the real world). You claim that serializations should use different mechanisms such as XPath, are you implying that this is a proposal with no real-world (i.e., serialized) applicability?

This is utter nonsense, the fragment ID is just a string. That's it John, a string. You do whatever you like with it.

This proposal defines IRI, which identify resources. Resources can be webpages or XML documents and usually are, but they can also be abstract concepts. For representing references to abstract concepts, the Resource Description Framework was invented. As an RDF expert, I have concerns about this proposal. In particular, addressing objects in a serialization-unspecific way usually requires methods like content negotiation to be implementable.

You see John, this is the problem. You're forcing in your world here, that we are not necessarily interested in. Making an IRI does not bring in RDF, nor does it bring in content negotiation. You have to live with the fact that others do not see things that way. An IRI is just a string. Nothing else.

To quote from https://www.ietf.org/rfc/rfc3987.txt:

"An IRI is a sequence of characters from the Universal Character Set (Unicode/ISO 10646)"

The standard also makes it absolutely clear that IRIs are not bound to a protocol with regard to this, on multiple places, e.g.

"Applications using IRIs as identity tokens with no relationship to a protocol MUST use the Simple String Comparison"

This is exactly our case, it's a string, it compares as a string, and it serves as identification of some DMLex entry part for us. We may call them "DMLex fragment identification strings" and not "IRIs", but given your attitude I doubt this would help here to move forward.

I have implemented this proposal and it took nearly 800 lines of code. I had to change the proposal in a way you find unacceptable to make this work (in order to obtain valid relative URIs in RDF serialization). It is not simple and the implementation discovered several other bugs (etymon should be unique on etymology #116, Should collocateMarker have a uniqueness constraints? #122, Example A1.11 fails uniqueness validation #123 and several documented on this PR). There is also a very simple alternative proposal (Solution 3).

Yes, all those are valid integrity checks that need to performed, thank you for that. We all know we need to do more of them, to find out all the forgotten small bugs in the spec here and there. None of that presents any substantial challenge.

In any case, this discussion leads nowhere. I find all the issues raised by John as void and none of the proposals by John are acceptable for me, particularly not the variant number 3, which is absolutely disastrous as discussed many times.

For the next meeting, I propose voting on this PR as is; and if it is not approved, we simply remove fragment identification from the specs completely and move on.

mjakubicek commented 4 months ago

This satisfies Problem 1, as it is much more vague and does not mandate a URI schema so more use cases can be satisfied. Problem 2 is mostly side-stepped as this proposal now doesn't require anything of producers or consumers of data. I also think it is closer to what @mjakubicek has in mind, as he doesn't want a "request-response" mechanism based on serialization, while an HTTP URI requires that you can make an HTTP request and receive a serialized response.

For last: it does NOT. "an HTTP URI requires that you can make an HTTP request". There is no "HTTP URI". Just "URI", and an URI (or IRI, in our case), unlike an URL, does not mandate you need to be able to locate the resource. The name of the protocol does not affect this.

But if all the bugs you is the http:// scheme, we may just use urn: instead. It would perhaps fit more even from the theoretical perspective, though that's going to be a very subjective issue.

jmccrae commented 4 months ago

@mjakubicek, you continue to make highly uncivil comments on a public forum.

We may call them "DMLex fragment identification strings" and not "IRIs", but given your attitude I doubt this would help here to move forward.

I think this is exactly what I just proposed, right?

There is no "HTTP URI"

HTTP URI is an established term. It is pretty clear it means URIs that use the http scheme.

But if all the bugs you is the http:// scheme, we may just use urn: instead. It would perhaps fit more even from the theoretical perspective, though that's going to be a very subjective issue.

I would support this, however I note that it requires a registration process with IANA as described in RFC 8141

mjakubicek commented 4 months ago

We may call them "DMLex fragment identification strings" and not "IRIs", but given your attitude I doubt this would help here to move forward.

I think this is exactly what I just proposed, right?

So if we keep everything else as is, and replace all occurrences of "IRI" in the spec with "DMLex fragment identification string", you will vote for this?

There is no "HTTP URI"

HTTP URI is an established term. It is pretty clear it means URIs that use the http scheme.

Yes, but not requiring that you can make an HTTP request, which is what you were saying, and I was refuting. It's not about quibbling, but about facts John. Facts that you present here that are simply not true, and you continue doing so despite being falsified multiple times.

But if all the bugs you is the http:// scheme, we may just use urn: instead. It would perhaps fit more even from the theoretical perspective, though that's going to be a very subjective issue.

I would support this, however I note that it requires a registration process with IANA as described in RFC 8141

Only if we would want to make our own namespace which we do not need to, there are other options (e.g. the tag namespace, maybe others too.) which require no central registration.

jmccrae commented 4 months ago

We may call them "DMLex fragment identification strings" and not "IRIs", but given your attitude I doubt this would help here to move forward.

I think this is exactly what I just proposed, right?

So if we keep everything else as is, and replace all occurrences of "IRI" in the spec with "DMLex fragment identification string", you will vote for this?

I guess so, but I would prefer that they did not start with http as this would be confusing

Yes, but not requiring that you can make an HTTP request, which is what you were saying, and I was refuting. It's not about quibbling, but about facts John. Facts that you present here that are simply not true, and you continue doing so despite being falsified multiple times.

"The term "Uniform Resource Locator" (URL) refers to the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network "location")" [RFC 3986] "The HTTP URL scheme is used to designate Internet resources accessible using HTTP (HyperText Transfer Protocol)" [RFC 1738]

My facts are pretty clear.

mjakubicek commented 4 months ago

We may call them "DMLex fragment identification strings" and not "IRIs", but given your attitude I doubt this would help here to move forward.

I think this is exactly what I just proposed, right?

So if we keep everything else as is, and replace all occurrences of "IRI" in the spec with "DMLex fragment identification string", you will vote for this?

I guess so, but I would prefer that they did not start with http as this would be confusing

Fine, I think noone really worries about the scheme being used here, which I see as a completely arbitrary choice. So, to avoid confusion, if this PR is changed so that all mentions of IRIs are replaced with "DMLex fragment identification string" and there is no "http://" prefix, you are happy with the rest and we can merge it and move on?

Yes, but not requiring that you can make an HTTP request, which is what you were saying, and I was refuting. It's not about quibbling, but about facts John. Facts that you present here that are simply not true, and you continue doing so despite being falsified multiple times.

"The term "Uniform Resource Locator" (URL) refers to the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network "location")" [RFC 3986] "The HTTP URL scheme is used to designate Internet resources accessible using HTTP (HyperText Transfer Protocol)" [RFC 1738]

My facts are pretty clear.

Facts are clear in that you now for the first time talk about a URL (i.e. a Uniform Resource Locator, not URI which is Uniform Resource Identifier), which was never discussed and never considered and never mentioned before. What you were saying before was that "an HTTP URI requires that you can make an HTTP request" -- and this is simply not true, and thus all your seemingly necessary implications you were making thereof are not true as well.

jmccrae commented 4 months ago

We may call them "DMLex fragment identification strings" and not "IRIs", but given your attitude I doubt this would help here to move forward.

I think this is exactly what I just proposed, right?

So if we keep everything else as is, and replace all occurrences of "IRI" in the spec with "DMLex fragment identification string", you will vote for this?

I guess so, but I would prefer that they did not start with http as this would be confusing

Fine, I think noone really worries about the scheme being used here, which I see as a completely arbitrary choice. So, to avoid confusion, if this PR is changed so that all mentions of IRIs are replaced with "DMLex fragment identification string" and there is no "http://" prefix, you are happy with the rest and we can merge it and move on?

You have exactly arrived at the solution I proposed this morning. Why would I object?

Of course, it needs to be implemented and #123 needs a resolution before this PR can be merged.

I also would like us to consider the use of listingOrder as an alternative mechanism, but I can make this a comment on the next CSD.

Yes, but not requiring that you can make an HTTP request, which is what you were saying, and I was refuting. It's not about quibbling, but about facts John. Facts that you present here that are simply not true, and you continue doing so despite being falsified multiple times.

"The term "Uniform Resource Locator" (URL) refers to the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network "location")" [RFC 3986] "The HTTP URL scheme is used to designate Internet resources accessible using HTTP (HyperText Transfer Protocol)" [RFC 1738] My facts are pretty clear.

Facts are clear in that you now for the first time talk about a URL (i.e. a Uniform Resource Locator, not URI which is Uniform Resource Identifier), which was never discussed and never considered and never mentioned before. What you were saying before was that "an HTTP URI requires that you can make an HTTP request" -- and this is simply not true, and thus all your seemingly necessary implications you were making thereof are not true as well.

We have already discussed URLs in fact:

I do not think we ever discussed that we would want the IRIs to be usable as URLs so this is a far reaching implicit assumption that is false at this moment. - @mjakubicek

URIs starting with http are HTTP URLs. The examples you have given are HTTP URLs so the assumption is pretty clear. - @jmccrae

That URLs designate such resources means that you only refer to resources that meet these requirements. Being accessible by HTTP means you can access them by making an HTTP request. Hence "an HTTP URL requires that you can make an HTTP request".

mjakubicek commented 4 months ago

Fine, I think noone really worries about the scheme being used here, which I see as a completely arbitrary choice. So, to avoid confusion, if this PR is changed so that all mentions of IRIs are replaced with "DMLex fragment identification string" and there is no "http://" prefix, you are happy with the rest and we can merge it and move on?

You have exactly arrived at the solution I proposed this morning. Why would I object?

Because this is not what your initial proposal was (this morning), as everyone can read up in the thread. I do not want the "#" to be part of "DMLex fragment identification strings", which is what your proposal starts with, and then continues on with other things, among others also mentioning this rename.

And that's why I'm double checking that we understand that the only change performed would be a wording issue solvable by a simple sed (i.e. find and replace command):

sed 's/IRI/DMLex fragment identification strings/g'

That's it.

We have already discussed URLs in fact:

I do not think we ever discussed that we would want the IRIs to be usable as URLs so this is a far reaching implicit assumption that is false at this moment. - @mjakubicek

Ok, you got me, we have already rule them out once ;-)

jmccrae commented 4 months ago

Fine, I think noone really worries about the scheme being used here, which I see as a completely arbitrary choice. So, to avoid confusion, if this PR is changed so that all mentions of IRIs are replaced with "DMLex fragment identification string" and there is no "http://" prefix, you are happy with the rest and we can merge it and move on?

You have exactly arrived at the solution I proposed this morning. Why would I object?

Because this is not what your initial proposal was (this morning), as everyone can read up in the thread. I do not want the "#" to be part of "DMLex fragment identification strings", which is what your proposal starts with, and then continues on with other things, among others also mentioning this rename.

And that's why I'm double checking that we understand that the only change performed would be a wording issue solvable by a simple sed (i.e. find and replace command):

sed 's/IRI/DMLex fragment identification strings/g'

That's it.

In principle that's right, although a quick look at the text shows that a little more care than a text replacement is needed!

The other part is removing the http:// prefix. I have a few suggestions here:

# Don't include lexicographicResource.uri at all (do we need it?)
entry/cat~1~noun

# Drop the http://
www.example.com/lexicon/entry/cat~1~noun

# Put the lexicographicResource.uri in brackets (one of the following)
[http://www.example.com/lexicon]entry/cat~1~noun
(http://www.example.com/lexicon)entry/cat~1~noun
<http://www.example.com/lexicon>entry/cat~1~noun

# Put the lexicographicResource.uri after the objectId
entry/cat~1~noun@http://www.example.com/lexicon

All seem good and avoid creating identifiers that are accidentally non-functioning URLs.

vojtech-kovar commented 3 months ago

I don't feel like adding more disagreement to this discussion, and nobody else wrote anything, so I did what you propose (i.e., renamed IRIs to "DMLex fragment identification strings" and removed the http:// prefix).

Just FTR: Though acceptable, I don't agree with it -- I think one of the reasons why we said first URIs and then IRIs is that they can be used as HTTP(S) URLs which is an advantage, and we are now losing this option (kind of, as adding http:// is in fact not that complex operation). At the same time, I am not bothered by many IRIs that don't work as HTTP URLs, or lead nowhere (I think I still don't fully understand John's reasons, but never mind).

I have also addressed the problem with #123, using listingOrder in cases where all the UNIQUE attributes are empty and there are more objects with duplicate IDs. (And the exact semantics of UNIQUEness still needs to be specified more precisely in the text somewhere around 1.3.5, I believe.)

jmccrae commented 3 months ago

Okay, sounds like a good fix.

My objection is I don't think that hard to understand: HTTP URLs that lead nowhere are called broken links and cause many problems not just to the user experience, but also affecting SEO for websites. Implementing only working HTTP URLs ensures the global uniqueness of these identifiers and prevents malicious attacks.