Handling of very large (>1GB) binary attachments

jodue commented 7 years ago

In MHD (Mobile access to Health Documents) [1], which builds on the FHIR standard, there is the need to be able to send large binary attachments (contained) in the DocumentReference resource. Eg. if a mobile client wants to store a multi-gigabyte video file of eg. an operation to a repository. Of course using a reference to a binary resource might be a solution in some cases for this but this might not always be possible when using a mobile client because of network (ie. NAT) restrictions when retrieving the binary resource. Also we want to limit the complexity for the client side.

Currently HAPI-FHIR will read the entire data stream into memory when parsing the JSON/XML object. For very large resources this can be a problem on servers which have to handle many clients in parallel and to accomplish this in a resource efficient manner it would be good if the binary resource could be streamed directly to disk. To accomplish this we propose to extend HAPI-FHIR to allow to set the XML and JSON parsers that FhirContext creates with newJsonParser and newXmlParser much like it already allows to set the RestfulClientFactory.

If this is okay with you we would like to implement his feature as a contribution to HAPI-FHIR!

[1] http://wiki.ihe.net/index.php/Mobile_access_to_Health_Documents_(MHD)

jamesagnew commented 7 years ago

This sounds awesome, I'm totally supportive.

jamesagnew commented 7 years ago

Ps- it sounds like you have a sensible design in mind... but feel free to reach out if you want to bounce ideas off.

jodue commented 7 years ago

Great! We will think about the implementation and get back to you if we have any questions!

bdenton commented 7 years ago

What about approaching this in a manner similar to MTOM/XOP used with SOAP??

· Client detects when a native content element (picture, video, etc) is larger than a specified threshold

o Client replaces that element with a “pointer” to…

o Client includes the native content as a MIME attachment

· Server sees the “pointer” and uses the MIME attachment, somehow, to resolve the unmarshalled native content element.

This approach using MIME attachments has several advantages

· There no longer is the +1/3 size expansion needed for Base64 encoding

· Since the attachment is a separate part of the HTTP stream, it is pretty easy to treat it differently (stream to output, send to media server, etc) from the primary XML/JSON payload

Don’t know whether something like this could be done without getting the FHIR spec updated….

Thoughts?

From: jodue [mailto:notifications@github.com] Sent: Wednesday, February 8, 2017 2:49 AM To: jamesagnew/hapi-fhir hapi-fhir@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [jamesagnew/hapi-fhir] Handling of very large (>1GB) binary attachments (#554)

Great! We will think about the implementation and get back to you if we have any questions!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/jamesagnew/hapi-fhir/issues/554#issuecomment-278294688, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AM9nghJnTYI_LkozLd0vTUSzT5wMhSHFks5raZ2ugaJpZM4L6iOk.

jodue commented 7 years ago

We thought about that as well. Using a multipart POST to submit large binary bundles would definitely have several advantages. However, this is not specified in the FHIR/MHD standard as of now so the only two standardized ways of doing this now are by either submitting the resource directly base64 encoded or by submitting an URL which in turn has the described downside that the client must be able to provide the data which is probably not desirable in some cases.

This seems to be a problem that definitely has to be solved in the standard the future as both possible solutions appear to be insufficient to handle large binary files properly. I would agree that the best solution would be a multipart stream which would allow us to first parse the FHIRobjects and then stream the binary data directly to disk.

For now i think we should try to handle this within the standard as best as possible. Maybe we should bring this to the attention of the MHD/FHIR committee so that it can be adressed in the near future?!

jamesagnew commented 7 years ago

Bill certainly does raise an important point. If your payload is potentially a gigabyte worth of video, putting it as-is in a resource is really not a great option, since no matter how efficient you are, it will be base64 encoded and therefore take up way more bandwidth than it needs to.

FHIR does of course already have the concept of the /Binary endpoint, which can be used to transfer binary media. One possible way of achieving what Bill is describing without needing to resort to multipart mime payloads could be to use the Binary endpoint to store the payload, coupled with an Attachment datatype with a URL reference to that payload.

HAPI's server doesn't currently have a super-efficient way of dealing with /Binary (it's currently parsed into a Binary resource) but that could definitely be improved, and wouldn't require spec changes.

jodue commented 7 years ago

This is definitely an interesting alternative, thanks for pointing this out!

We still have to be prepared to receive rather large binary resources via bundle (base64 encoded data wastes about 1/3 of data worst case) but you made a good point as this is already covered by FHIR! I did not think about that because MHD does not discuss this issue at all!

Coming back to my original post, i think we also should improve the efficiency of the handling of binary resources. Making it possible to (much in the same way as with base64 encoded binary resources within a bundle) directly stream the content to disk and to never hold the entire resource in memory as they could sometimes get rather large!

jamesagnew commented 7 years ago

So, I guess there are two sides to this:

If you want to look at improving the /Binary endpoint in a server, probably the easiest way to do this would be with a custom interceptor that handles the request before the rest of the framework gets to it. I think this would be quite easy to write actually.

If you want to look at enhancing the parser, I guess ultimately what you need is some way of registering a handler for binary content so that the handler would get it rather than the parsed object model. Unless you're envisioning some sort of streaming API like SAX, replacing the current parsed object model (which acts more like DOM).

Either way, there is one big caveat: The XML parser internally works on a stream, so having an efficient streaming handler is easy. On the other hand, the JSON parser does not work on a stream, but rather parses the JSON object tree into memory entirely before translating it into FHIR. We do this because of a JSON limitation: JSON hashes do not have ordered keys, so you need to be able to handle properties in an object coming in an arbitrary order. This isn't an insurmountable problem, but isn't trivial either.. It means for example that you might get an extension on a base64 attachment before you get the attachment itself, or that you might get a bunch of elements in a resource body before you even know what type of resource you are parsing.

ohr commented 6 years ago

Maybe it would be possible to start with POSTing and GETting content (i.e. not the Binary resource, but the binary payload). Internally this is currently mapped to/from an IBaseBinary instance. I wonder whether we could simply add a getDataHandler()/setDataHandler() to the interface (and its implementations) that would allow to stream the content either directly from the request of from a (temporary) file. Cutting out content from a regular resource (or even from a transaction bundle like in MHD ITI-65) sounds really complicated to me as a first step.

jodue commented 6 years ago

I think the best way would be as stated by james before to upload the binary content and then referencing this content in the bundle. This has, however, the downside that we may have to store large binary data which may never be used/linked so there would also have to be a cleanup logic there. Anyway i would expect to have clients in the field which will only be able to upload the "normal" way directly in the bundle as base64 so this should be supported in any case.

I had a close look at the architecture and how i could integrate this but failed to find a good way to tackle the discussed problems without ending up having all the content in memory all over again. also the XML vs JSON issue (ordering) discussed above will be quite tricky.

I finally solved this issue outside of HAPI-FHIR by utilizing an "ExtractorStream" which will cut out all base64 content before the bundle is parsed and directly stream this binary data to where it should end up...

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

hapifhir / hapi-fhir

Handling of very large (>1GB) binary attachments #554