Duplicate package identifiers (packageId) in json paged results

uwieske commented 4 years ago

Just to inform you that you have duplicates in the set of packages of CHRG. I found package with packageId='CHRG-115hhrg27368' to be present more than once in my set. My set is the combined (union) paged results over all fetched pages. Apparently, I presume that this one (this) package is present in multiple paged results.

I assume this is not a correct behaviour since it packageId is considered a unique property of the the model (resource).

I solved this in my code by removing all duplicates but decided to inform you to about your data repository with respect to data integrity.

uwieske commented 4 years ago

An example of one of my requests: Fetching https://api.govinfo.gov/collections/CHRG/2019-01-01T00:00:00Z?pageSize=100&offset=0&api_key=

I traverse over all pages.

jonquandt commented 4 years ago

Hi, @uwieske. Thanks for reporting this. You're right that packageId is an unique value within the system.

I'm trying to recreate now.

I found that package in the following calls (modified yours to only include packages from the 115th Congress) https://api.govinfo.gov/collections/CHRG/2019-01-01T00:00:00Z?pageSize=100&offset=1200&congress=115 - result 93 on this page (so 793 overall).

I continued to iterate through the request using the values supplied by nextPage and did not see another instance of this package appearing.

Are you using nextPage (or equivalent requests with iterating offset values) to go through the collections response?

In general, there is a possibility that if new packages are being added/updated, it could move a particular package within the query from one page to a subsequent page. There were 196 packages added or updated on 3/26.

https://api.govinfo.gov/collections/CHRG/2020-03-26T00:00:00Z/2020-03-27T00:00:00Z?pageSize=100&offset=0&api_key=DEMO_KEY

uwieske commented 4 years ago

@jonquandt I am using the the nextPage url which is provided in my response object. So I traverse through the data set by following the nextPage of each response object I get consecutively. I thin that the error was near offset 5999 / 6000 based on my given URL in my initial issue text.

jonquandt commented 4 years ago

@uwieske - are you still seeing this issue? I have been unable to recreate

jonquandt commented 3 years ago

Closing as I have not been able to recreate

smsagan commented 3 years ago

@jonquandt I am experiencing the same problem @uwieske did but within the BILLSTATUS collection. I have created a minimal testcase that demonstrates the problem. I have no trouble reproducing this with your https://api.govinfo.gov/docs/ -- the duplication happens more than half the time for me.

There are ("count":) 129 packages in BILLSTATUS lastModified before or on 2021-01-15T12:46:50Z. I want to see them so I make two requests: one to offset 0, and one to nextPage which is offset 100. I attach the results of both

attachment api_0.txt from https://api.govinfo.gov/collections/BILLSTATUS/2019-01-01T00:00:00Z/2021-01-15T12:46:50Z?offset=0&pageSize=100
attachment api_100.txt from https://api.govinfo.gov/collections/BILLSTATUS/2019-01-01T00:00:00Z/2021-01-15T12:46:50Z?offset=100&pageSize=100

There are 100 records in api_0.txt and 29 in api_100.txt. A perfect 129, but there is one duplicate: in this example BILLSTATUS-108hconres163 is duplicated in both results. What's the consequence of this? I can remove duplicates while iterating through your collection results page by page, BUT there is a deeper problem. There are 129 packages within the datetime range I requested, but the two pages only provide 128 unique packages. There is 1 missing!

I can recover the missing package by backtracking. If I pull from offset=90 (attachment api_90.txt), I receive 39 packages (as expected), and after I remove all the expected duplicates, I end up with the 29 packages the offset=100 pull should have returned. BILLSTATUS-108hconres171 was the missing package when using offset=0 and offset=100.

api_0.txt api_90.txt api_100.txt

jonquandt commented 3 years ago

@smsagan thanks for bringing this to our attention and including some useful details. I am going to investigate further.

smsagan commented 3 years ago

Thank you @jonquandt - please let me know if I can provide anything further for your investigation.

One thing that I have observed: all duplicate and missing packages are from the exact same lastModified timestamp as the last package of the initial page and first package of the nextPage. In the example I sent, the timestamp of the last package of api_0.txt, first of api_100.txt, and both the duplicate and missing packages was 2021-01-15T12:46:27Z. Hope this helps!

jonquandt commented 3 years ago

@smsagan - thanks again for this. We have a potential solution in mind.

Currently the collections endpoint returns the results sorted by lastModified descending. As you know, lastModified time is stored only at the second level. For some collections, like USCOURTS and BILLSTATUS, we often publish many packages within a short timespan. This means that there is the opportunity for several BILLSTATUS packages to have the same lastModified time. In cases where this occurs, the results are returned in essentially a random order.

An alternative that we are looking at is switching the sort of results to be based on the packageId. Our packageId values are unique to only a single piece of content, so this should mean that duplicates don't appear within paginated calls for a given time.

The potential downside to this is that it may make it more difficult to narrow down a collections request in the case of a large republishing/reprocessing activity that generates a large number of results in the API. But that's something less important than ensuring that no duplicates appear OR that items aren't missed during crawls.

smsagan commented 3 years ago

@jonquandt I understand you explanation - thank you for that!

In your proposal where results are returned in order of packageId rather than lastModified, How would us users of your BILLSTATUS collection maintain the latest edition of every package? i.e. If you update a package today, how will I know you did so such that I know to pull the data from that package again.

jonquandt commented 3 years ago

@smsagan - great question. Here is a nominal flow:

At x time, end user script queries the collections endpoint for the BILLSTATUS collection (or any other collection) with a lastModifiedStartDate parameter of y (where y is some time before x), including a total count of results
API returns list of packages that have been added or updated since y, according to the lastModified value of each package
end user script retrieves content/metadata/summaries for BILLSTATUS packages within the collections response, iterating through all pages necessary to meet the full count
At z time, end user script queries the collections endpoint for the BILLSTATUS collection (or any other collection) with a lastModifiedStartDate parameter of x+1 second -API returns all packages that have been modified since x+1 second (which should be a list of all packages that have been modified after the original call in 1 above)

If you were to do something like this, you could force limits by doing a 24 hour period and check all of 2021-03-01 at 00:00:001Z on 3/2: https://api.govinfo.gov/collections/BILLSTATUS/2021-03-01T00:00:00Z/2021-03-01T23:59:59Z?offset=0&pageSize=100&api_key=DEMO_KEY and then the next day, update the parameters to get the previous day: https://api.govinfo.gov/collections/BILLSTATUS/2021-03-02T00:00:00Z/2021-03-02T23:59:59Z?offset=0&pageSize=100&api_key=DEMO_KEY

The collections service is intended to only show results that have been changed since the lastModifiedStartDate (or within a specific range, if you include the lastModifiedEndDate parameter as well.

The published endpoint would do something similar, but the published endpoint is keyed on the dateIssued values for a given package.

The difference:

-lastModified time: when something was added or updated on govinfo

dateIssued: the actual date of issue, often within the document itself -- e.g. the date of a Federal Register issue, or a date for a USCOURTS opinion

Alternatively, you could theoretically store a list of all packages that you have ever retrieved and then do individual package summary requests against them to check the lastModified value within the summary, but this become less efficient as your set increases beyond a small number.

smsagan commented 3 years ago

Thank you @jonquandt. Let me check my understanding because I've been a bit confused -- what I am missing in your thorough description is how you intend for us to use offsets and packageId within the y to x and x+1s to z queries. If these two queries return more than 100 packages (your page size limit), our scripts will need to iterate through multiple pages using offset.

With the packages within these two queries ordered by lastModified, we observe this duplicate+missing package issue.

If I understand correctly, you are going to change the ordering of packages from lastModified to packageId, which should eliminate the duplicate+missing issue since packageId is unique to a single package; whereas lastModified was often shared by multiple packages. Am I on the right track?

jonquandt commented 3 years ago

@smsagan - yes, that's correct. Because the packageId is completely unique, there shouldn't be a collision.

In request 1, there could be a thousand results -- you would need to iterate through all the pages needed to get the full count of packages as part of the collection of data in step three. I'm modifying the comment above to clarify that.

smsagan commented 3 years ago

@jonquandt Thank you for the clarity. This sounds great! Do you have an expectation for when you'll be able to implement the change to ordering results by packageId?

I appreciate your thoroughness and attention to this issue -- it's been a pleasure working with you, sir

jonquandt commented 3 years ago

@smsagan This is currently slated for our March release, so by the end of the month. I will follow up when this is in production.

jonquandt commented 3 years ago

This is now in production.

usgpo / api

Duplicate package identifiers (packageId) in json paged results #61