ietf-tools / relaton-data-ids

Bibliographic data information for Internet-Drafts in Relaton format
7 stars 10 forks source link

Draft versions need to relate to each other #6

Closed strogonoff closed 1 year ago

strogonoff commented 2 years ago
ronaldtse commented 2 years ago

@strogonoff in IETF Internet-Drafts, the last two numeric digits provide a sequential "draft number", as described here: https://github.com/ietf-ribose/bibxml-service/issues/13

i.e.

Pattern 2: https://{hostname}/public/rfc/bibxml-ids/reference.I-D.draft-{example-name}-{draft-number}.xml

The draft number is a strictly increasing number validated by the Datatracker service when first uploaded. From the filename, it can be immediately inferred that draft-tsou-softwire-6rd-multicast-02 supersedes draft-tsou-softwire-6rd-multicast-01.

However, some drafts start with a draft number 00. If we know that a 00 document exists, we know 01 supersedes it. If we only have 01, we do not know if 00 exists.

strogonoff commented 2 years ago

@ronaldtse Why was this closed? There are documents that supersede each other. Relaton provides a field for that, but it’s not being used.

This shouldn’t be difficult at GHA stage. While parsing 01 check if version 00 exists, and if so then create a relation. Different approaches are possible (a single pass with drafts sorted by ID in advance, or a double pass that fills in relations after initial metadata is formed).

strogonoff commented 2 years ago

In fact, no, we should not even check if versions exist. If we know at build time that draft 01 supersedes 00, it is semantically correct to link to 00. Maybe it doesn’t exist yet, but that’s not the problem of Relaton.

strogonoff commented 2 years ago

This is important from UX standpoint. BibXML service provides a way to go back/forward to superseded/superseding versions, and missing relations mean readers don’t get this opportunity with Internet Drafts.

ronaldtse commented 2 years ago

@strogonoff can you explain what we need to do here? You want Relaton to analyse the supersession relationships? If so please help file a ticket at relaton-ietf.

strogonoff commented 2 years ago

@strogonoff can you explain what we need to do here? You want Relaton to analyse the supersession relationships? If so please help file a ticket at relaton-ietf.

I think they should, because they supersede each other, but I have filed this as a question for a reason.

andrew2net commented 2 years ago

I don't understand what the question is. If we need to implement the relations for Internet-Drafts, then ok, it's possible to do.

TonyLHansen commented 2 years ago

11 is absolutely correct: there are two patterns of the names

Legacy pattern(s) to implement:

Pattern 1: https://{hostname}/public/rfc/bibxml-ids/reference.I-D.{example-name}.xml Pattern 2: https://{hostname}/public/rfc/bibxml-ids/reference.I-D.draft-{example-name}-{draft-number}.xml

(The draft number will sometimes be referred to as the sequence number or generation number.) (Note: The "draft-" prefix (after "reference.I-D.") is an important part of the differentiator for the patterns to indicate that there IS a draft number at the end.)

I did a check of the IDs collected on tools.ietf.org. There are 36016 drafts with -00, and 22662 with -01, sequence numbers. Out of the -01 drafts, there are only 268 where a -00 is not also saved. So 0.7% of the drafts with a -01 saved there did not have a -00 preceding that. So with 99.3% certainty, I can claim that series almost always start with a draft number of -00.

However, the IDs collected on tools.ietf.org are not complete. There are cases where, for example, a -07 is stored, but the data tracker has evidence of -00 through -06 existing. On the flip side, there are a number of drafts that tools.ietf.org has that the datatracker doesn't.

We can definitely say that there is a definite relationship between -00 and -01, and between -01 and -02, etc.

In the rfc-index, there are some documents that were assigned numbers, but were never issued. They are still catalogued, but the data for them says "Not issued".

I think the best path forward is to assume that the relationship exists, but in some strange cases, a given sequence number might not have been issued or is missing from the various databases.

strogonoff commented 2 years ago

@andrew2net: I filed this to at least clarify how things should work, even if nothing is to be done. I work on a service that provides access to the data, but I am not deeply familiar with how the data should look like and organizational specifics.

To my view it looked like Internet Draft versions are documents that supersede each other. If so, they probably should relate this way and if e.g. no data for a previous version exists (in cases like @TonyLHansen pointed out) then relation could be empty.

Or maybe Internet Draft versions are actually a single document, just with version history exposed as separate bibliographic items. That would imply that in our data a single document does not mean a single bibliographic item, and this should be clarified. (The service already kind of allows this, by putting multiple bibliographic items with the same identifier—which I-D versions have—on the same page, but whether it’s a good design decision or a workaround the lack of relationships is unclear.) Then adding relations could be conceptually wrong? I don’t know. It’s a subtle distinction…

TonyLHansen commented 2 years ago

The drafts do form a series of documents, with each version superceding the previous one for the series. There is a definite relationship.

Each version can also be individually referenced, allowing us to reference something that was said specifically in (say) version -03 of the draft, and something else that was said specifically in (say) version -17.

I don't know if it affects the work here, but there are also relationships between series when drafts become adopted by working groups, or divorced from working groups. The datatracker knows most of this data. These relationships probably do NOT need to be stored in this database.

ronaldtse commented 2 years ago

Thanks @TonyLHansen , agree that we should have a main "I-D" that groups the versions together whenever possible.

That said there is a potential issue for recognizing the name of the document. There are 200 documents that has a name that ends with '-\d+'.

e.g.

If we limit the pattern to end with \-\d\d, we still have 49:

reference.I-D.draft-farmer-6man-exceptions-64-00.xml
reference.I-D.draft-farmer-6man-exceptions-64-01.xml
reference.I-D.draft-farmer-6man-exceptions-64-02.xml
reference.I-D.draft-farmer-6man-exceptions-64-03.xml
reference.I-D.draft-farmer-6man-exceptions-64-04.xml
reference.I-D.draft-farmer-6man-exceptions-64-05.xml
reference.I-D.draft-farmer-6man-exceptions-64-06.xml
reference.I-D.draft-farmer-6man-exceptions-64-07.xml
reference.I-D.draft-farmer-6man-exceptions-64-08.xml
reference.I-D.draft-farmer-6man-exceptions-64-09.xml
reference.I-D.draft-farmer-6man-routing-64-00.xml
reference.I-D.draft-farmer-6man-routing-64-01.xml
reference.I-D.draft-farmer-6man-routing-64-02.xml
reference.I-D.draft-ietf-16ng-ip-over-ethernet-over-802-dot-16-12.xml
reference.I-D.draft-ietf-ipsec-ah-hmac-md5-96-00.xml
reference.I-D.draft-ietf-ipsec-ah-hmac-sha-1-96-00.xml
reference.I-D.draft-ietf-ipsec-auth-hmac-md5-96-02.xml
reference.I-D.draft-ietf-ipsec-auth-hmac-ripemd-160-96-03.xml
reference.I-D.draft-ietf-nfsv4-03-00.xml
reference.I-D.draft-ietf-tsvwg-ieee-802-11-00.xml
reference.I-D.draft-ietf-tsvwg-ieee-802-11-01.xml
reference.I-D.draft-ietf-tsvwg-ieee-802-11-02.xml
reference.I-D.draft-ietf-tsvwg-ieee-802-11-03.xml
reference.I-D.draft-ietf-tsvwg-ieee-802-11-04.xml
reference.I-D.draft-ietf-tsvwg-ieee-802-11-05.xml
reference.I-D.draft-ietf-tsvwg-ieee-802-11-06.xml
reference.I-D.draft-ietf-tsvwg-ieee-802-11-07.xml
reference.I-D.draft-ietf-tsvwg-ieee-802-11-08.xml
reference.I-D.draft-ietf-tsvwg-ieee-802-11-09.xml
reference.I-D.draft-ietf-tsvwg-ieee-802-11-10.xml
reference.I-D.draft-ietf-tsvwg-ieee-802-11-11.xml
reference.I-D.draft-mahy-sipping-16-04.xml
reference.I-D.draft-songlee-aes-cmac-96-04.xml
reference.I-D.draft-spaghetti-idr-deprecate-8-9-10-00.xml
reference.I-D.draft-srinivasan-fr-over-mpls-with-frf-16-00.xml
reference.I-D.draft-szigeti-tsvwg-ieee-802-11-00.xml
reference.I-D.draft-szigeti-tsvwg-ieee-802-11-01.xml
reference.I-D.draft-szigeti-tsvwg-ieee-802-11-02.xml
reference.I-D.draft-zhang-mif-api-extension-802-21-00.xml
reference.I-D.draft-zhang-mif-api-extension-802-21-01.xml
reference.I-D.draft-zhang-mif-api-extension-802-21-02.xml
reference.I-D.draft-zhang-mif-api-extension-802-21-03.xml
reference.I-D.draft-zhang-mif-api-extension-802-21-04.xml
reference.I-D.draft-zhang-mif-api-extension-802-21-05.xml
reference.I-D.draft-zhang-mif-api-extension-802-21-06.xml
reference.I-D.draft-zhang-mif-api-extension-802-21-07.xml
reference.I-D.draft-zhang-mif-api-extension-802-21-08.xml
reference.I-D.draft-zhang-mif-api-extension-802-21-09.xml
reference.I-D.draft-zhang-mif-api-extension-802-21-10.xml

The confusion is this: if we are given a document reference reference.I-D.draft-szigeti-tsvwg-ieee-802-11.xml, the service does not necessarily know whether this is a versioned I-D or an unversioned I-D.

Possibly not a big deal for existing documents, since we can index this pattern. However, for new documents, the IETF may want to enforce that the I-D name cannot end in 2-digits.

I don't know if it affects the work here, but there are also relationships between series when drafts become adopted by working groups, or divorced from working groups. The datatracker knows most of this data. These relationships probably do NOT need to be stored in this database.

Probably not. This is interesting information but probably not necessary for citation purposes.

TonyLHansen commented 2 years ago

This statement is false:

The confusion is this: if we are given a document reference reference.I-D.draft-szigeti-tsvwg-ieee-802-11.xml, the service does not necessarily know whether this is a versioned I-D or an unversioned I-D.

That is where the "draft-" at the beginning comes into play. This is ONLY present with the versioned documents. For the UN-versioned names, "draft-" must NOT be at the beginning of the name. (After reference.I-D." of course.)

reference.I-D.draft-zhang-mif-api-extension-802-21-00.xml # versioned reference.I-D.zhang-mif-api-extension-802-21.xml # unversioned reference.I-D.zhang-mif-api-extension-802-21-00.xml # NOT ALLOWED reference.I-D.draft-zhang-mif-api-extension-802-21.xml # NOT ALLOWED

There is NO ambiguity i the reference names.

ronaldtse commented 2 years ago

@TonyLHansen thank you for the clarification! I have been confused about this all along.

ronaldtse commented 2 years ago

@TonyLHansen I guess the real confusion is this: in the current implementation, all of these paths work.

Identical data:

Identical data:

Should we reject the invalid paths?

TonyLHansen commented 2 years ago

Traditionally, the invalid paths did not work.

When the TCL scripts broke, I "temporarily" switched to a redirect script that fails to block the invalid paths.

But please do reject the invalid paths in the new implementation.

andrew2net commented 2 years ago

That said there is a potential issue for recognizing the name of the document. There are 200 documents that has a name that ends with '-\d+'.

@ronaldtse I noticed the Internet-Draft documents have an anchor without a version. For example, the document draft-weis-gdoi-iec62351-9-00 has an anchor I-D.weis-gdoi-iec62351-9. It allows recognizing the version in the name of the document.

ronaldtse commented 2 years ago

Thank you @TonyLHansen for the determination, this will greatly help @strogonoff refine the correct behavior for the BibXML service!

strogonoff commented 2 years ago

@ronaldtse Regarding this:

I guess the real confusion is this: in the current implementation, all of these paths work.

Actually, this issue does not really relate to returned XML data, and actually should not affect xml2rfc paths or XML output. The relationships are made use of in GUI only, when users search for and explore documents.

ronaldtse commented 2 years ago

this issue does not really relate to returned XML data, and actually should not affect xml2rfc paths or XML output. The relationships are made use of in GUI only, when users search for and explore documents.

@strogonoff maybe there is some misunderstanding in my comment https://github.com/ietf-ribose/relaton-data-ids/issues/6#issuecomment-1047371606. It does affect the URL paths patterns

These paths work, and are correct:

These paths work right now, but are incorrect and should not work (return a 404 instead):

What I meant is that the BibXML service should reject the last two path patterns, which is what @TonyLHansen requested.

strogonoff commented 2 years ago

@ronaldtse

These paths work right now, but are incorrect and should not work (return a 404 instead):

This may be a fine distinction, but I believe the requirement was that preexisting paths should return correct data, while behavior for not-exactly-matching paths was not specified (so if not-exactly-correct path returns the same data, it does not violate that requirement).

If it is a requirement that non-matching paths should necessarily return 404, then some logic at xml2rfc path compatibility layer needs to be adjusted ASAP.

Based on Tony’s comment requesting

do reject the invalid paths in the new implementation.

I take it that we need this. Should be done within the upcoming week…

strogonoff commented 2 years ago

Correction, I think the issue is less global than I initially thought. I’ll just make it so that versioned URLs for I-Ds are rejected, while other xml2rfc-style paths maintain their existing behavior.

(NOTE: this means non-exactly-matching xml2rfc paths may return bibliographic data and not 404. If this is definitely undesirable let me know. It may be tricky to implement since we need to deal with new bibliographic data being available under xml2rfc paths.)

Should be done by Monday (https://github.com/ietf-ribose/bibxml-service/issues/157)

ronaldtse commented 2 years ago

Not sure why this is so complicated?

It just means:

In the following cases, the path should return 404:

This change ONLY applies to I-Ds.

(NOTE: this means non-exactly-matching xml2rfc paths may return bibliographic data and not 404. If this is definitely undesirable let me know. It may be tricky to implement since we need to deal with new bibliographic data being available under xml2rfc paths.)

This should not happen because by definition, the name of a draft never starts with draft-xxx.

strogonoff commented 2 years ago

This change ONLY applies to I-Ds.

Yes, when I understood that it is a simple change. At first I thought this was a request for all xml2rfc paths. Due to fuzzy matching, it is by design that they may return valid data for more than one path, so inexact paths do not guarantee 404.

I-D behavior is a special case of the above behavior, and a specific provision for I-Ds can be made to return 404 for versioned paths.

strogonoff commented 2 years ago

Since https://github.com/ietf-ribose/relaton-data-ids/issues/15 is stalled for now (Nick is against using primary ID and I can’t switch to docnumber), let’s add these superseded/supersedes relations between I-D versions? GUI needs to give the user a way to navigate to the latest draft at least clicking through relations. cc @ronaldtse

ronaldtse commented 2 years ago

Sorry I think this thread got sidetracked possibly by my comment a while ago.

What @strogonoff needs here is this:

These relationships are important for the BibXML service to be able to show the an I-D in a series of versions.

@andrew2net can you help implement this in relaton-ietf?

andrew2net commented 2 years ago

These relationships are important for the BibXML service to be able to show the an I-D in a series of versions.

@andrew2net can you help implement this in relaton-ietf?

@ronaldtse sure, I can. As soon as finish relaton-w3c and relaton-bipm

ronaldtse commented 2 years ago

@strogonoff this issue belongs in relaton-ietf. I'm creating an issue there.

ronaldtse commented 2 years ago

Since #15 is stalled for now (Nick is against using primary ID and I can’t switch to docnumber), let’s add these superseded/supersedes relations between I-D versions? GUI needs to give the user a way to navigate to the latest draft at least clicking through relations. cc @ronaldtse

I've created the corresponding issues that deal with this. I believe this is sufficient for the current use case, let me know if not.

ronaldtse commented 2 years ago

@strogonoff is this completed? If so please help close it. Thanks.

rjsparks commented 1 year ago

@strogonoff, @ronaldtse - I'm closing this - if there's anything left to do, please reopen, or better yet - create other issues for what remains.