hapifhir / hapi-fhir

🔥 HAPI FHIR - Java API for HL7 FHIR Clients and Servers
http://hapifhir.io
Apache License 2.0
2.01k stars 1.32k forks source link

Serialization: DontEncodeElements use results in invalid resource #6304

Open qligier opened 2 weeks ago

qligier commented 2 weeks ago

Describe the bug When using the DontEncodeElements feature of the parser, the serialized resource is invalid if an element contains a single value, and that was ignored by DontEncodeElements. The element is added without content, which is not allowed by the FHIR specification.

To Reproduce The code

final var patient = new Patient();
patient.setId("test-parser");
patient.addName().setFamily("Doe");

final var parser = FhirContext.forR4Cached().newJsonParser();
parser.setDontEncodeElements(Set.of("*.name.family"));
parser.encodeResourceToString(patient);

results in:

{"resourceType":"Patient","id":"test-parser","name":[{}]}
<Patient xmlns="http://hl7.org/fhir"><id value="test-parser"/><name/></Patient>

Expected behavior The parser should ignore Patient.name, as the only value in name[0] is ignored. But the parser is only checking if the name list is empty, or if all values are empty (which it is not).

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Additional context It looks like it's missing a check here: https://github.com/hapifhir/hapi-fhir/blob/fb93c3d601272a146904e438c11af93f32e70f1e/hapi-fhir-base/src/main/java/ca/uhn/fhir/parser/JsonParser.java#L553-L563 The EncodeContext contains the information about elements to ignore, but it's not checked at this point. It will only be checked when name[0] is serialized. Same for the XML parser.

jamesagnew commented 2 weeks ago

Agree this is an issue - it would be hard to fix in a way that doesn't hurt performance though. Patches welcomed.

qligier commented 2 weeks ago

I can try to make a pull request for that, but my understanding of the parser is pretty low. I probably won't be able to come with an optimized patch.

About performance, what are the important use cases? I guess the performance of the parser without that option enabled is important because it most likely represents the huge majority of its usage. Is it important to optimize the parser with that option enabled?

jamesagnew commented 2 weeks ago

setDontEncodeElements(...) is used extensively within the JPA server internally, so it needs to be fast since it will be a part of the code path for every write to the JPA database.

The main thing I'm worried about is that we're not running huge emptiness checks in cases where they aren't needed. So, e.g. if you've called setDontEncodeElements("Patient.name.family") we'd need some optimized code path that could test all instances of Patient.name and determine if they had a value in family only, skipping outputting even the name key if that was the case. And presumably then we'd have to test again for each individual repetition in the case where one name has other values but the other name has only a family.

This code would get complicated fast, and would need to be aware of extensions too I suppose - E.g. if you exclude Patient.name.family we'd presumably still want to include a patient name if all it had was an extension on the root of the HumanName.

One alternate approach that might conceivably work would be to rework JacksonWriter so that it buffered the output events and only flushed the buffer for a given block once it encountered actual meaningful contents in that block. This too could get complex fast though and would need to be very comprehensively unit tested and performance benchmarked to ensure no regressions occurred.

qligier commented 1 week ago

Thank you for the insight, I had no idea. Yeah, a good solution could quickly become hard to implement. Another solution, a bit more hacky-ish, could be to check the output string for the presence of '{}' or '[]' (for the JSON parser). If so, it could be removed with a pattern replace. Checking for the presence of a specific substring is really fast, and removing it can be fast if the regex is carefully crafted. It would be less optimized for the XML parser because the detection of an empty content would only be doable with another regex.