Implement legacy path pattern: IETF Internet-Drafts (`bibxml3`, `bibxml-id`)

ronaldtse commented 3 years ago

IETF Internet-Drafts (bibxml3, bibxml-id)

(previous location) http://xml2rfc.tools.ietf.org/public/rfc/bibxml3/
Pattern 1: http://xml2rfc.ietf.org/public/rfc/bibxml-ids/reference.I-D.example-name.xml
Pattern 2: http://xml2rfc.ietf.org/public/rfc/bibxml-ids/reference.I-D.draft-example-name-99.xml

Legacy pattern(s) to implement:

We need to parse the pattern to return the appropriate BibXML content.

strogonoff commented 3 years ago

@ronaldtse where is legacy/XML data repository for bibxml3 in https://github.com/ietf-ribose?

ronaldtse commented 3 years ago

@strogonoff the data source is not available yet for bibxml3/bibxml-ids.

ronaldtse commented 3 years ago

Legacy path specification described here: https://github.com/ietf-ribose/bibxml-data-ids/issues/1

The bibxml3 endpoint matches URLs this way:
    url(r'^bibxml3/%(name)s(?:-%(rev)s)?.xml$' % settings.URL_REGEXPS, views_doc.document_bibxml),
(though an upcoming release will likely escape the . before xml.)

so, yes, you want to be looking at names that look like https://datatracker.ietf.org/doc/bibxml3/draft-ietf-stir-passport-rcd-09.xml

Once you've generated something with a version, it won't change, but you will also need to be able to respond to version-less requests, such as:
https://datatracker.ietf.org/doc/bibxml3/draft-ietf-stir-passport-rcd.xml

strogonoff commented 3 years ago

Note: we don’t parse name or rev from data, the only formatting variable available in legacy path pattern currently is {ref} representing our canonical reference obtained from filename.

Support for more formatting variables will be filed separately.

strogonoff commented 3 years ago

@ronaldtse

The first pattern in ticket description does not match the second pattern in your last comment.
If we use the second pattern,
1. I need to know which Relaton fields correspond to “rev” and “name” in this pattern.
  
  We don’t have Relaton data for bibxml-id, but we can use NIST for example: http://34.229.41.119:8000/api/v1/ref/nist/NISTIR_4790/
  
  What is “rev” there?
  
  Note: if “rev” can be missing for some citations, those citations may be inaccessible by their legacy paths.
2. The reference. prefix is shared for all legacy paths. If it shouldn’t be shared for bibxml-id, let me know.

ronaldtse commented 3 years ago

The first pattern in ticket description does not match the second pattern in your last comment.

Let me clarify:

Are the legacy paths for the BibXML service, currently defined here: https://svn.ietf.org/svn/tools/xml2rfc/website/rfcs/bibxml/bibxml-ids/gen-bibxml-ids

This is code from the Datatracker service given by @rjsparks:

url(r'^bibxml3/%(name)s(?:-%(rev)s)?.xml$' % settings.URL_REGEXPS, views_doc.document_bibxml),

The source is: https://github.com/ietf-svn-conversion/ietfdb-final/blob/c6fc13a38ef66d2c2b6d4931627ffd1cbdb4aa98/ietf/doc/urls.py#L89-L90

The Datatracker service is the "authoritative" endpoint for I-D documents.

strogonoff commented 3 years ago

This doesn’t answer which pattern should we match. Is Datatracker another external system we need to support? Do we need to support multiple patterns for different legacy systems? Or is Datatracker of interest to GHA that prepares authoritative data for indexing, and not to this public/legacy API service?
See also note 2ii I edited in on reference. prefix.

ronaldtse commented 3 years ago

We should implement the legacy pattern in the original post. The datatracker system is for indexing and updating purpose.
Yes, the "reference." prefix is used for all legacy paths.

strogonoff commented 3 years ago

Ah, great… I think that means #28 would be unnecessary so far.

strogonoff commented 3 years ago

Although, if filenames in our future bibxml-data-ids dataset don’t contain the “draft” prefix or “draft-number” suffix, the extra flexibility might still be required to support specified path patterns.

ronaldtse commented 2 years ago

2. If we use the second pattern, I need to know which Relaton fields correspond to “rev” and “name” in this pattern. We don’t have Relaton data for bibxml-id, but we can use NIST for example: http://34.229.41.119:8000/api/v1/ref/nist/NISTIR_4790/

The Relaton models for IETF ID and NIST differ a lot. So let's not make that comparison.

strogonoff commented 2 years ago

Here is a report for a random subset of 128 paths (out of 90k+ total) under bibxml3: bibxml3-random-subset.zip

Most paths seem to fall back to original xml2rfc data, others resolve automatically to correct new bibitems in relaton-data-ids in which case XML is different and diffs are available in the report. Diffs seem to be manageable.

Testing all paths would take a while and incur many requests to Datatracker (part of path resolution logic) and xml2rfc tools (for reference comparison), but could be done.

rjsparks commented 2 years ago

If needed, we could build a self-contained test instance with all the needed components (dev instance of the datatracker, etc) and do walk of the entire dataset without affecting the production datatracker, and (I assume) not needing significant other external I/O.

strogonoff commented 2 years ago

If needed, we could build a self-contained test instance with all the needed components (dev instance of the datatracker, etc) and do walk of the entire dataset without affecting the production datatracker, and (I assume) not needing significant other external I/O.

Absolutely, this could help.

Right now to use a different URL than “https://datatracker.ietf.org” as Datatracker API root requires a change in the code (datatracker.request.BASE_DOMAIN), but it’s straightforward to edit the file before running docker-compose. (I could move this value to configuration or environment if warranted.)

Otherwise there should be no issues. The test script can be passed a local BibXML service instance’s URL:

mkdir -p reports && \
    python test_paths.py \
    http://localhost:8000/public/rfc \
    /path/to/local/bibxml-data-archive \
    --dirname bibxml3 --verbosity 2 --reports-dir reports --randomize

rjsparks commented 2 years ago

for the datatracker, you can build a local dev copy quickly. Just clone the datatracker repo and run (cd docker; ./run). There's more at the github project page.

ietf-tools / bibxml-service

Implement legacy path pattern: IETF Internet-Drafts (`bibxml3`, `bibxml-id`) #13