Incomplete batch returned before end-of-data when paginating xform submissions (stage-api.ona.io)

gstuder-ona commented 3 years ago

Environmental Information

Onadata version: stage-api.ona.io

Problem description

When paginating XForm data from the new endpoint in an ETL, sometimes the full batch size is not returned even though more data exists. Right now, we're using a not-full batch as an indicator of end-of-data - it's not clear how we would detect end-of-data otherwise.

Expected behavior

Batches should be full unless no more data exists from paginated API.

Steps to reproduce the behavior

It's rare - but you can see the query here:

[Scanning current data to see if anything was deleted]
2021-04-12 10:30:31,450 [Samza StreamProcessor Container Thread-0] INFO com.onaio.beam.etl.WithTombstones  - Continuing stale change scan: maxReadVersion=1616699971116 maxChangeVersion=null numStale=0 numRestore
d=0
2021-04-12 10:30:31,461 [WithTombstones-pool-32-thread-1] INFO com.onaio.beam.etl.WithTombstones  - Fetching next batch: lastModifiedAt=2021-03-25T19:44:10.601Z isEob=true batchSizeHint=500
2021-04-12 10:30:31,461 [Samza StreamProcessor Container Thread-0] INFO com.onaio.beam.etl.WithTombstones  - Scan reader not ready, yielding: maxReadVersion=1616701449790 maxChangeVersion=null numStale=0 numRestored=0 medianAsyncYieldDuration=PT5.433S
2021-04-12 10:30:31,461 [Samza StreamProcessor Container Thread-0] INFO com.onaio.beam.etl.WithTombstones  - Waiting for next action: nextAction=continueStaleScan nextWakeAt=2021-04-12T10:30:36.883Z
2021-04-12 10:30:31,461 [WithTombstones-pool-32-thread-1] INFO com.onaio.beam.etl.WithTombstones  -   Fetching next batch: queryType=XformLastModifiedGt queryAt=2021-03-25T19:44:10.601Z batchSize=500
[Fetched a batch of size 500 xform submissions (truncated to 499 since we need to re-request the last timestamp)]
2021-04-12 10:30:35,496 [WithTombstones-pool-32-thread-1] INFO com.onaio.beam.etl.WithTombstones  -   Fetched batch: batchSize=500 batchStartId=92546 batchStartAt=2021-03-25T19:44:11.444Z batchEndId=93045 batchEndAt=2021-03-25T19:52:03.985Z
2021-04-12 10:30:35,497 [WithTombstones-pool-32-thread-1] INFO com.onaio.beam.etl.WithTombstones  - Fetched batch: batchSize=499 batchStartId=92546 batchStartAt=2021-03-25T19:44:11.444Z batchEndId=93044 batchEndAt=2021-03-25T19:52:02.976Z isEos=false
[Continuing...]
2021-04-12 10:30:36,951 [Samza StreamProcessor Container Thread-0] INFO com.onaio.beam.etl.WithTombstones  - Continuing stale change scan: maxReadVersion=1616701449790 maxChangeVersion=null numStale=0 numRestored=0
2021-04-12 10:30:36,951 [Samza StreamProcessor Container Thread-0] INFO com.onaio.beam.etl.WithTombstones  - Reader stats: lastModifiedAt=2021-03-25T19:44:11.444Z isEob=false isEos=false currentFormId=92546 numSubmissions=1498
2021-04-12 10:30:36,961 [Samza StreamProcessor Container Thread-0] INFO com.onaio.beam.etl.WithTombstones  - Scan reader not ready, yielding: maxReadVersion=1616701922144 maxChangeVersion=null numStale=0 numRestored=0 medianAsyncYieldDuration=PT5.676S
2021-04-12 10:30:36,961 [WithTombstones-pool-32-thread-1] INFO com.onaio.beam.etl.WithTombstones  - Fetching next batch: lastModifiedAt=2021-03-25T19:52:02.976Z isEob=true batchSizeHint=500
2021-04-12 10:30:36,961 [Samza StreamProcessor Container Thread-0] INFO com.onaio.beam.etl.WithTombstones  - Waiting for next action: nextAction=continueStaleScan nextWakeAt=2021-04-12T10:30:42.627Z
2021-04-12 10:30:36,961 [WithTombstones-pool-32-thread-1] INFO com.onaio.beam.etl.WithTombstones  -   Fetching next batch: queryType=XformLastModifiedGt queryAt=2021-03-25T19:52:02.976Z batchSize=500
[Got batch of size 75, assumed end-of-stream]
2021-04-12 10:30:38,497 [WithTombstones-pool-32-thread-1] INFO com.onaio.beam.etl.WithTombstones  -   Fetched batch: batchSize=75 batchStartId=93045 batchStartAt=2021-03-25T19:52:03.985Z batchEndId=93119 batchEndAt=2021-03-25T19:54:44.482Z
2021-04-12 10:30:38,497 [WithTombstones-pool-32-thread-1] INFO com.onaio.beam.etl.WithTombstones  - Fetched batch: batchSize=75 batchStartId=93045 batchStartAt=2021-03-25T19:52:03.985Z batchEndId=93119 batchEndAt=2021-03-25T19:54:44.482Z isEos=true
2021-04-12 10:30:42,785 [Samza StreamProcessor Container Thread-0] INFO com.onaio.beam.etl.WithTombstones  - Continuing stale change scan: maxReadVersion=1616701922144 maxChangeVersion=null numStale=0 numRestored=0
2021-04-12 10:30:42,787 [Samza StreamProcessor Container Thread-0] INFO com.onaio.beam.etl.WithTombstones  - No more data in scan reader, marking final changes as stale
2021-04-12 10:30:42,792 [Samza StreamProcessor Container Thread-0] INFO com.onaio.beam.etl.WithTombstones  - Finished stale change scan: maxReadVersion=1616702084482 maxChangeVersion=1616705874911 numStale=161 numRestored=0
2021-04-12 10:30:42,792 [Samza StreamProcessor Container Thread-0] INFO com.onaio.beam.etl.WithTombstones  - Waiting for next action: nextAction=startTombstoneScan nextWakeAt=2021-04-12T10:30:42.785Z
2021-04-12 10:30:42,792 [Samza StreamProcessor Container Thread-0] INFO com.onaio.beam.etl.WithTombstones  - Starting tombstone scan: maxChangeVersion=1616705874911 medianSyncReadSize=499
[Started deleting stuff since we assume no more data is coming]
2021-04-12 10:30:42,792 [Samza StreamProcessor Container Thread-0] INFO com.onaio.beam.etl.WithTombstones  -   Creating tombstone for stale value in change stream: id=96540 version=1616705742804 ts=2021-03-25T20:57:54.911Z
...

Additional Information

How is a client of the paginated XForm endpoint meant to detect end-of-data? Do we need to wait for a completely empty batch?

DavisRayM commented 3 years ago

Hey @gstuder-ona , a not-full batch list usually denotes that there may be no more data left from that endpoint; An alternative to it that I would suggest would be using the link header returned by the response.... that would give you the last/first/next e.t.c(depending on current page) page of data the application has(Could have a flow where you use the next link till it's no longer there or something of the sort)

sometimes the full batch size is not returned even though more data exists.

Weird that the pagination doesn't return all the data by the end of the page in some scenarios; I'll need to look more into this to come up with a solution since this doesn't seem to be easy to replicate

gstuder-ona commented 3 years ago

@DavisRayM we can use the link header if that's the more reliable source - I'd be a little worried that something that interrupted the batch size might also break that header?

gstuder-ona commented 3 years ago

@DavisRayM - the xml endpoint doesn't seem to include the Link header? Is that possible to add to the paginated XML API?

We haven't seen this issue recur in a significant amount of new testing we've done so for now the bug is kind of in limbo - but it'd be nice to add the "Link" header and give ourselves an out when it does recur.

gstuder-ona commented 3 years ago

Thinking more, I'm not sure the query in the API is at a high enough level for the OnaData app to know if the results were truncated - but I'm not a Django expert, someone who is may be able to answer quickly.

Our fallback position is that any data returned indicates there could be more data, the only end-of-data marker we recognize is an empty batch. This is fine, but a little inefficient and complex to reason about as it interacts badly with the OnaData timestamp versioning (we can't really trust the very last timestamps until we're sure the server has processed everything).

gstuder-ona commented 3 years ago

We haven't seen this happen again, so can't really verify anything further. Keeping in icebox for now

DavisRayM commented 3 years ago

Closing this please re-open if it occurs again

HusnaHariz commented 2 years ago

This issue still re-occurs where the batch returned is truncated. For instance when the pipeline tries to fetch the following batch

2022-08-02 19:15:55,120 [WithTombstones-pool-43-thread-1] INFO com.onaio.beam.etl.WithTombstones  -   Fetching next batch: queryType=XformLastModifiedGt queryAt=2022-04-20T12:20:06.900Z batchSize=500
https://api.ona.io/api/v1/data/635635.xml?query=%7B%22_date_modified%22:%7B%22$gte%22:%222022-04-20T12:20:06.901%22%7D%7D&limit=500&sort=_date_modified
Status Code: 200

We get a 200 status code but the batch returned is missing all the records. We simply get the following truncated response

<?xml version="1.0" encoding="utf-8"?>
<submission-batch serverTime="2022-08-02T16:16:04.112805+00:00">

The initial assumption was that such a response is only missing the closing tag, and hence the batch is empty. But this is not the case. If the URL above is tested elsewhere, several submissions are returned.

Note: This happens once in a while. Most of the times, the pipeline receives complete and valid responses. Nonetheless, when it does occur it has some severe effect in that it causes the pipeline to delete records in our canopy db as it assumes the records have been deleted in onadata. We can update our pipeline to retry fetching the batch X times then possibly crash if it still persists or until it receives a complete batch. But it will still be useful if it can be investigated/fixed on the onadata side.

onaio / onadata