Open gstuder-ona opened 3 years ago
Hey @gstuder-ona , a not-full batch list usually denotes that there may be no more data left from that endpoint; An alternative to it that I would suggest would be using the link header returned by the response.... that would give you the last/first/next e.t.c(depending on current page) page of data the application has(Could have a flow where you use the next link till it's no longer there or something of the sort)
sometimes the full batch size is not returned even though more data exists.
Weird that the pagination doesn't return all the data by the end of the page in some scenarios; I'll need to look more into this to come up with a solution since this doesn't seem to be easy to replicate
@DavisRayM we can use the link header if that's the more reliable source - I'd be a little worried that something that interrupted the batch size might also break that header?
@DavisRayM - the xml endpoint doesn't seem to include the Link header? Is that possible to add to the paginated XML API?
We haven't seen this issue recur in a significant amount of new testing we've done so for now the bug is kind of in limbo - but it'd be nice to add the "Link" header and give ourselves an out when it does recur.
Thinking more, I'm not sure the query in the API is at a high enough level for the OnaData app to know if the results were truncated - but I'm not a Django expert, someone who is may be able to answer quickly.
Our fallback position is that any data returned indicates there could be more data, the only end-of-data marker we recognize is an empty batch. This is fine, but a little inefficient and complex to reason about as it interacts badly with the OnaData timestamp versioning (we can't really trust the very last timestamps until we're sure the server has processed everything).
We haven't seen this happen again, so can't really verify anything further. Keeping in icebox for now
Closing this please re-open if it occurs again
This issue still re-occurs where the batch returned is truncated. For instance when the pipeline tries to fetch the following batch
2022-08-02 19:15:55,120 [WithTombstones-pool-43-thread-1] INFO com.onaio.beam.etl.WithTombstones - Fetching next batch: queryType=XformLastModifiedGt queryAt=2022-04-20T12:20:06.900Z batchSize=500
https://api.ona.io/api/v1/data/635635.xml?query=%7B%22_date_modified%22:%7B%22$gte%22:%222022-04-20T12:20:06.901%22%7D%7D&limit=500&sort=_date_modified
Status Code: 200
We get a 200 status code but the batch returned is missing all the records. We simply get the following truncated response
<?xml version="1.0" encoding="utf-8"?>
<submission-batch serverTime="2022-08-02T16:16:04.112805+00:00">
The initial assumption was that such a response is only missing the closing tag, and hence the batch is empty. But this is not the case. If the URL above is tested elsewhere, several submissions are returned.
Note: This happens once in a while. Most of the times, the pipeline receives complete and valid responses. Nonetheless, when it does occur it has some severe effect in that it causes the pipeline to delete records in our canopy db as it assumes the records have been deleted in onadata. We can update our pipeline to retry fetching the batch X times then possibly crash if it still persists or until it receives a complete batch. But it will still be useful if it can be investigated/fixed on the onadata side.
Environmental Information
Problem description
When paginating XForm data from the new endpoint in an ETL, sometimes the full batch size is not returned even though more data exists. Right now, we're using a not-full batch as an indicator of end-of-data - it's not clear how we would detect end-of-data otherwise.
Expected behavior
Batches should be full unless no more data exists from paginated API.
Steps to reproduce the behavior
It's rare - but you can see the query here:
Additional Information
How is a client of the paginated XForm endpoint meant to detect end-of-data? Do we need to wait for a completely empty batch?