xml2rfc legacy path XML response format & IETF migration strategy

strogonoff commented 2 years ago

This optimal approach depends primarily on the following factors:

Whether most of xml2rfc citations are expected to have counterparts in bibxml-data datasets
The strategy IETF takes when migrating from xml2rfc tools to the new IETF BibXML service

Per our initial concept (also expressed by Ronald in https://github.com/ietf-ribose/bibxml-service/issues/18#issuecomment-981132588), so far we are going down the approach 1 route. However, we may be forgetting other approaches, and approach 3 in particular is much simpler to implement, so before spending more effort on legacy path support than already spent I think it’d be a good thing to clarify in advance the strategy IETF plans to take when migrating from xml2rfc tools to the new IETF BibXML service.

Approach 1

May be optimal if: BibXML service can return new (extended) citation data for given legacy path hit, and most xml2rfc citations have (or will have) counterparts in our bibxml-data datasets.

On BibXML service level:

Retrieval API/GUI: if RefNotFound is raised during current handling of a legacy path pattern (meaning we could not parse the legacy pattern or find a matching citation), attempt to request the file from xml2rfc before returning HTTP 404.
- If needed, we can crawl xml2rfc to avoid the extra network trip (but we already incur extra trips with DOI, for example)
- Extend current handling of legacy path patterns to support manual xml2rfc -> bibxml-data map, which can be stored in a Git repo (better from persistence standpoint) and indexed at service start, or in PostgreSQL. (Use this map before falling back to xml2rfc.)
Indexer management UI: add xml2rfc migration helper. List files from xml2rfc and show which citations they resolve to in the new system. Provide a way to search for matching migration in new dataset and map xml2rfc filename to it manually where needed.

Approach 2

May be optimal if: BibXML service can return new (extended) citation data for given legacy path hit, regardless of whether xml2rfc citations have preexisting counterparts in bibxml-data.

On GHA component level, treat xml2rfc tools as another standard metadata source; crawl/parse and include in final datasets accordingly.
- GHA can generate a static map xml2rfc filename -> new canonical reference, in cases where filenames change and if legacy filenames must be maintained.

Approach 3

May be optimal if: All xml2rfc citations have bibxml-data counterparts; xml2rfc consumers are sensitive to citation data structure; xml2rfc consumers will be rewritten to new BibXML service API URLs as part of migration away from xml2rfc.

Have the BibXML indexer just crawl the entirety of xml2rfc, and serve it as static files under old filenames, without the fancy mappings.

ronaldtse commented 2 years ago

As discussed via voice, we have to do something like Approach 1.

Static data sets:

Updates: purely handled by GitHub Actions
Storage: stored at GitHub bibxml/relaton-data-*
Indexing: BibXML indexer pulls updates from Git whenever suitable
Mechanism
1. Upon request, find information from index (bulk storage)

Dynamic data set (only DOI):

Updates: Every request gets sent to authoritative source
Storage: No backend storage
Caching: caches data for a fixed period of time, then flushes
1. Upon request, respond with a cache hit
2. On cache miss, proxy to authoritative source, then store in cache

Hybrid data set (Internet-Drafts):

Updates: Handled by GitHub Actions in bulk. Incremental updates proxied to Datatracker, then cached.
Storage: Bulk storage at GitHub bibxml/relaton-data-*
Indexing: BibXML indexer pulls updates from Git whenever suitable
Caching: Caches data for a fixed period of time, then flushes
Mechanism
1. Upon request, respond with a cache hit
2. On cache miss, check index (bulk storage)
3. If not in bulk storage, fetch from datatracker, then store in cache

strogonoff commented 2 years ago

I thought datatracker pings us, not the other way around?

strogonoff commented 2 years ago

After 1:1 I believe approach will have to change. GUI for static mapping is scrapped, and instead of having three types of datasets we might assign datasets various types of sources instead. This would make it easier to switch between sources and make changes to individual source processing, which seems inevitable with evolving requirement landscape.

ronaldtse commented 2 years ago

I thought datatracker pings us, not the other way around?

Datatracker will ping us when a new draft is submitted:

If datatracker pings us with the particular draft name, we can selectively update the cache with that entry.
If datatracker pings us without any data, we can load up draft data from some archive, or just flush the cache between the "bulk load" and the current time.

The behavior here is not yet determined and we wait for @rjsparks on the details.

strogonoff commented 2 years ago

@ronaldtse

OK, so Datatracker pings us in the end.

Currently, we have API endpoints that allow reindexing particular refs in a dataset, and also an endpoint that allows flushing index for a dataset.

The problem arises if we use GHA-generated bibxml-data dataset as the source. If Datatracker pings us to flush index, we can fall back to proxy requests to Datatracker and return newest data. However, any dataset can be rebuilt from bibxml-data at any moment by another request, and if that happens and bibxml-data hadn’t yet been regenerated using GHA then BibXML service will regress to returning stale data.

I believe to avoid this race condition we should either 1) make Datatracker ping cause an immediate rebuild of bibxml-data-ids source from GHA (there will still be a race, but less probable), or 2) remove the dependency on GHA and index the ids data directly from the source (wherever that would be—I presume somewhere in Datatracker).

I think option (2) is a bit more elegant and reliable. I’m finalizing a configurable source design to deal with more complex source situation. It should allow specifying different chains of sources for datasets (which provides a generalized foundation for required fallback behavior flexibility), and also force a nice encapsulation of different source indexing & retrieval logic (which facilitates option (2)).

rjsparks commented 2 years ago

I was expecting that the datatracker would tell you when a new version of a draft is available - it can either give you the bibxml for that draft during that push, or you can pull it from the standard bibxml3 endpoint at the datatracker.

I was not expecting there to be a api that would flush parts of the cache by timerange, nor to be able to say "rebuild timerange X-Y" through the api. If you think that's needed, please provide an argument for it.

An administrative ability to provide a bulk (re)-load would be nice to have.

strogonoff commented 2 years ago

API doesn’t allow partial cache reset per se, but one could think of “reindex metadata for citation X in dataset Y” as doing something akin to clearing part of citation index + updating the source.

ronaldtse commented 2 years ago

@strogonoff

However, any dataset can be rebuilt from bibxml-data at any moment by another request, and if that happens and bibxml-data hadn’t yet been regenerated using GHA then BibXML service will regress to returning stale data

This does not actually happen:

In the Internet-Drafts dataset, there are two types of identifiers
- (a) a pointer to the latest draft version: draft-xxx
- (b) a pointer to a specific draft version: draft-xxx-NN where NN is a sequential number.
When Datatracker tells the BibXML service exactly what is new, there can only be one type of "new thing":
- a new draft "draft-xxx-{N+1}" has published
- This means that the BibXML service needs to do only this:
  - if it has a cache of draft-xxx, drop it, and for this path proxy from Datatracker on the next request.
When someone (Datatracker, admin) says "flush from time X", then:
- we only need to flush all the "draft-xxx" entries, so that the next request to "draft-xxx" will be proxied to Datatracker to find out "what was updated"
- if someone accesses "draft-xxx-{N+1}" (assuming that this is the new item), we will need to proxy it to Datatracker regardless someone has flushed the cache or not.

i.e.

The data of the draft-xxx-NN pattern is persistent, they don't change. Only if the draft-xxx-{MM > NN}, where MM is a larger number than the static data we have NN, we need to fetch from Datatracker. Once we have the MM version, it no longer changes too. So we can cache draft-xxx-{MM} long as we wish.
The draft-xxx pattern is not persistent and is solely controlled by Datatracker. We might not even want to cache the draft-xxx pattern for long, maybe 10 mins or less. (imagine someone uploading successive drafts quickly (I've done that)).

@rjsparks with regards to:

I was expecting that the datatracker would tell you when a new version of a draft is available - it can either give you the bibxml for that draft during that push, or you can pull it from the standard bibxml3 endpoint at the datatracker.

This is excellent and we are aligned. The part that we are not aligned to is whether we have an incremental back fill that represents complete data.

If we solely use event notifications (push or pull) without a back fill, we will run into a potential synchronisation issue if the BibXML service somehow did not get a notification from Datatracker, that a new version of a particular draft has been added

i.e. I can have a "data hole" that is not filled in the continuous timeline of Datatracker updates. This data hole presents an issue in search, e.g. how come I can find draft 03 and 05 but not 04?

This is the reason we wish to have a consistent back fill that is always complete, so that from t=0 to t=now-X we have complete data, and from t=now-X to t=now we have cached data.

An administrative ability to provide a bulk (re)-load would be nice to have.

Yes, this is as answered by @strogonoff -- that functionality is already provided now (except for DOI, and the I-Ds issue under discussion).

strogonoff commented 2 years ago

@ronaldtse Do I understand this correctly

I-Ds are immutable, so we don’t have a case of “updated citation”
There is a requirement to support non-numbered “alias” path, which would duplicate the response for “draft with the latest number”
BibXML responses are time-sensitive, and need to be near real-time

It seems like (2) is a special case for I-Ds, and I’m trying to find out how to deal with this generically.

Question on that: does citation metadata vary across draft versions, or only the draft number?

If nothing changes (same author, same keywords, same status, etc.) except for draft number, maybe we don’t need to treat each draft as a separate full-blown standard reference for citation purposes?
And if metadata changes, can’t consumers find out the specific draft version they want to reference (e.g., via Datatracker) and take that to BibXML? Presumably, if changes between draft versions are substantial, one doesn’t want to reference a draft without a particular version.

For (3), can’t say I anticipated in the architecture that BibXML service should have such short reaction times, I recall confirming originally that standard metadata can appear with hours of delay. If this is a fundamental requirement, I want to take it into account and rethink our design where necessary, or we’ll end up with contrived mechanisms and special cases compensating for what would be architectural shortcoming.

ronaldtse commented 2 years ago

I-Ds are immutable, so we don’t have a case of “updated citation”

Yes. Individual drafts are versioned and therefore immutable. The only way to change a draft (and its bibliographic information) is to upload a new version of the draft.

There is a requirement to support non-numbered “alias” path, which would duplicate the response for “draft with the latest number”

Correct.

BibXML responses are time-sensitive, and need to be near real-time

BibXML responses FOR Internet-Drafts are time-sensitive because users may need to cite a newly published draft within a short period of time, and I believe IETF wishes to have the Datatracker data available on the BibXML service available with minimal delay since the services are operated by the same entity.

It seems like (2) is a special case for I-Ds, and I’m trying to find out how to deal with this generically.

Yes, great.

Question on that: does citation metadata vary across draft versions, or only the draft number?

As mentioned, the only way to change citation metadata is through a new version, so yes. The draft number is a sequential integer so it varies.

And if metadata changes, can’t consumers find out the specific draft version they want to reference (e.g., via Datatracker) and take that to BibXML? Presumably, if changes between draft versions are substantial, one doesn’t want to reference a draft without a particular version.

So this is similar to where in ISO, they have dated vs undated references. A dated reference points to a particular version. An undated reference always refers to the "tip" version.

Different authors like to do different things:

some prefer to cite a particular version and only "upgrade" the citation after checking diffs
some prefer to always cite the latest copy

For (3), can’t say I anticipated in the architecture that BibXML service should have such short reaction times, I recall confirming originally that standard metadata can appear with hours of delay. If this is a fundamental requirement, I want to take it into account and rethink our design where necessary, or we’ll end up with contrived mechanisms and special cases compensating for what would be architectural shortcoming.

The requirement for Internet-Drafts has not changed, and this is what we knew and signed up for in the beginning.

I still don't see why there is any problem with I-Ds being updated and made available quick -- because if the BibXML service gets pinged (with the updated information), then we just insert an additional record to the database/cache and the search index. This is a common type of operation.

rjsparks commented 2 years ago

Some observations and some fine details. Hopefully these reduce confusion rather than increase it. I'll speak to the "normal" expectations and then explore a few edge cases at the end.

Given that we're shifting to a model where you are going to have a complete copy of the bibxml3 dataset (rather than the original concept where you built it over time by querying the datatracker for anything you did not have), when asked for a draft without a version number you can simply serve the highest version number you have for the draft.

As already noted, old versions of drafts are not updated, only new versions are issued. When a new version is issued, the datatracker will notify this service. We need to work out whether it provides the details to you directly or if you fetch on notification. The expectation is that the service will have the details to serve about a new version of a draft in some small number of seconds, not minutes.

When I say "a new version of a draft" above, I'm also including the initial (-00) version of a draft when it is created.

The datatracker is the source of truth for the bibxml3 content. It is the place where new versions of drafts are submitted. If it is down, there simply are not new versions of drafts.

Edges:

When asked for version n+1 (or even n+m) of a draft when you only have data for up to version n, you could poll the datatracker to see if it has an answer for that version. Yes, this could be abused to create nuisance traffic, but we haven't seen that happen, and until we do, the benefit of just handing an eager contributor the information for the thing they posted seconds ago is worth the exposure. There are APIs at the datatracker you can use to ask what the most recent version is, but for this case, it would be less computation to just ask for the bibxml.
We do occasionally need to update the bibxml for older drafts, including versions that are not the most recent version of that older draft. This is not because the draft changed, but because there was an error in the original parsing of the draft. For drafts that are submitted only as ascii files (most of the drafts in history), authors are extracted using heuristics. Those heuristics occasionally fail, and later, sometimes years later, we get a request to fix up the extracted author set for a version of a draft. When this happens we will use the API you provide to tell you to update your copy of the bibxml for that particular version of that draft.
We have a very rare condition where a version of a draft is taken out of the archive. We will use the api to tell you remove your copy of the bibxml for that draft version.

rjsparks commented 2 years ago

Question on that: does citation metadata vary across draft versions, or only the draft number?

As mentioned, the only way to change citation metadata is through a new version, so yes. The draft number is a sequential integer so it varies.

Specifically - almost all of the metadata can change between revisions. The draft's title can change. Authors are added or removed. The abstract can change. The stream can change. Pretty much the only thing that is guaranteed to be invariant is the draft name.

rjsparks commented 2 years ago

One other edge case: we may occasionally decide to change the content of a large subset (potentially all) of the bibxml3 database. For instance, at the moment, the datatracker's generated bibxml3 speaks only to one format of the document known to the datatracker - see https://datatracker.ietf.org/doc/bibxml3/draft-sparks-sipcore-multiple-reasons.xml and note that it only provides a link to the txt format of the draft. The datatracker knows about the xml format, and we may alter the bibxml3 generation to include a format tag for it, and all other drafts for which we have the xml. At that point we would need to use the administrative interface you've previously mentioned to trigger a bulk reload.

TonyLHansen commented 2 years ago

I read through Robert's comments above and agree with what he said.

One note though: I think that there are a number of historic I-Ds that datatracker has no record of, but there do exist one or both of bibxml-id files as well as a copy of the I-D itself on the tools servers. I can't quantify that number without doing a serious dig, but two out of the three older I-D's that I just picked at random were missing from the datatracker.

There are 35k -00 I-Ds on dechaunac, and only about 20k bibxml -00 reference files. I'm curious what the number actually stored in the datatracker is.

An item on my longer term goals for bibxml-id was to mine the /www/tools.ietf.org/id directory to find and generate those missing references.

rjsparks commented 2 years ago

Those fixes are things to be done in the datatracker. They don't block completion of Ribose's current task.

TonyLHansen commented 2 years ago

At one level I agree, but the above definitely affects knowing when we're done with this stated goal:

The strategy IETF takes when migrating from xml2rfc tools to the new IETF BibXML service

What I take from the above is that Ribose by themselves cannot totally satisfy this goal.

rjsparks commented 2 years ago

I'm taking this conversation to email - it's not part of the project being tracked here.

TonyLHansen commented 2 years ago

sure

strogonoff commented 2 years ago

The strategy IETF takes when migrating from xml2rfc tools to the new IETF BibXML service

What I take from the above is that Ribose by themselves cannot totally satisfy this goal.

(For the sake of clarity, I was referring to API consumers that have been using xml2rfc tools webserver. Those consumers would have to be updated, at least to point to new domain name, likely to handle newer citation format, and hopefully later to migrate to new API that rely on authoritative document identifiers rather than legacy filenames. If Ribose does not directly update those consumers, then I wanted to highlight the possible need to coordinate and take organization’s plans regarding those consumers into account.

If upgrading preexisting API consumers was in fact within the scope of Ribose’s task, then I did not realize that and my statement was in error.)

rjsparks commented 2 years ago

@strogonoff Tony's point was different - it was about some things that the existing bibxml3 knows about that the datatracker does not yet. There's no new requirement for Ribose's efforts there.

TonyLHansen commented 2 years ago

Agreed. There's no new Ribose requirement for my comment above.

ietf-ribose / bibxml-project