sync-for-science / tracking

Issue tracking for S4S reference stack development
0 stars 0 forks source link

Handle large FHIR payloads better in Test Suite #132

Closed mtmcduffie closed 6 years ago

mtmcduffie commented 6 years ago

Good example, Explanation of benefit

jmandel commented 6 years ago

Note: this applies to the CMS BlueButton data set which has very large EOB payload

erikwiffin commented 6 years ago

Just talked to @mtmcduffie and he suggested that "failing politely" would be an improvement over the current situation, and I like that a lot better than trying to intelligently trim the response if it's too big.

The absolute easiest thing to do here looks like setting a read timeout when fetching resources in utils.py and assuming that if it takes more than one second to download the resource, it's too big. Obviously that's a pretty sloppy metric for what we're trying to accomplish.

It looks like we could also limit the size of the response by making a streaming request. That would let us more accurately specify a response size limit, at the cost of being more complicated and I don't know if response.json still works.

mtmcduffie commented 6 years ago

Does Pagination show up in the Conformance/Metadata statement? Maybe if we say If Timeout + No Pagination = Fail?

erikwiffin commented 6 years ago

I don't think it does. Certainly not something defined in the S4S project requirements.

jmandel commented 6 years ago

Anybody know how reliably the response content length header is available?

erikwiffin commented 6 years ago

I have yet to actually see the large EOB payload, so I can't say for sure.

I think reading that header is probably insufficient anyways, since unless we either set a timeout, or stream the request, we're going to end up downloading the entire response, which puts a 43MB json blob into memory.

If we're fine with that, we don't need the content length header, since we can just check the length of the response of the blob, and cut it off at an arbitrary point.

jmandel commented 6 years ago

Totally agree that actually performing a short read is a good way to tell what's what! I should also note that the CMS team is planning to implement paging, so this particular issue should be an interim thing -- but it'll be good to beef up our defensive practices.

erikwiffin commented 6 years ago

Just pushed up a PR for this. I haven't tested it against CMS yet, but by messing with the size limit (currently 1MB) I've established that it works against the SMART demo server.

Using iter_content screws up the response object, so I ended up using a combination of approaches. First I check the content-length header if it's present, then I complete the download. This should fail early for well configured servers, but still give us a useful error message for others.

Also, propagated skip_reason to features, and am displaying that in the test results now. This is previously a piece of information that was just getting lost unless you had access to the logs, so it might be helpful for future debugging.

image