Closed dbaumann2 closed 1 year ago
see Pull Request #584
@dbaumann2, could you please upload some test artifacts that demonstrate the race condition? Thanks
Unfortunately I have no automated test demonstrating the issue. Here is some logging output from our application. Response message with ID 21 is expected to have Message ID "urn:uuid:0e5b9323-7076-47ca-b0f6-9d34c17d6eaf" and therefore cannot be correlated by our application. This led to us investigating the code of the toolkit. Hope this helps to understand the issue.
Message ID 21: Request, AdhocQueryRequest, MessageId urn:uuid:0e5b9323-7076-47ca-b0f6-9d34c17d6eaf Message ID 20: Request, AdhocQueryRequest, MessageId urn:uuid:452d35f4-ed04-4f6c-8230-bd70083b3e45 Message ID 19: Response... Message ID 22: Response... Message ID 20: Response, AdhocQueryResponse, RelatesTo urn:uuid:452d35f4-ed04-4f6c-8230-bd70083b3e45 Message ID 21: Response, AdhocQueryResponse, RelatesTo urn:uuid:452d35f4-ed04-4f6c-8230-bd70083b3e45
I understand the issue but without a test procedure that can consistently reproduce the problem, I wouldn't be able to confirm the bug fix.
Is a simple unit test sufficient or do you need an integration test?
Is a simple unit test sufficient or do you need an integration test?
Any of those type of tests is fine as long as it shows the toolkit generated MessageID is incorrect.
I've written a SoapUI Performance Test that can detect the error:
I built XDS Toolkit from your pull request, ran the load tests but according to the results in the screenshot below, the error condition still exists.
The good news is that your test is able to confirm the issue you reported but I'm afraid the pull request code did not solve the issue.
I am adding @andrewmccaffreynist to help look into the root cause of this issue.
We did not take a deeper look at the code. It just seemed to be the relevant place, but we unfortunately did not have time to compile and deploy it yet. Thank you very much for further investigating this issue!
We did some more investigation and found the problematic place. It is the creation of event directories
, which are then used for saving the request, response and some other stuff.
When handling a request in gov/nist/toolkit/fhir/simulators/servlet/SimServlet.java
a new gov/nist/toolkit/simcommon/server/SimDb.groovy
instance is created every time. In the constructor, an event Directory
with the current Date
as name. This is already the problem, because if two SimDb
instances get created with the same Date
(which can happen, since the timestamp is made up of year.month.day.hours.minutes.seconds.milliseconds
), everything serialized into the event directory will be overwritten. Hence, explaining the mismatch of messageID
and relatesTo
.
We "fixed" this issue by adding a new UUID
to gov/nist/toolkit/installation/server/Installation.java#asFilenameBase
like so:
// Installation.java#asFilenameBase
return val + "-"+ UUID.randomUUID();
Using the provided SoapUI project, we can see no errors anymore. However, we don't know the side effects this will have on the entire Toolkit.
An additional improvement would be to not rely on I/O operations for the request/response handling. For example, ThreadLocal
s could be used for storing the data within the current thread.
After some more digging around, a more elegant solution would be to synchronize during the file event creation inside SimDb
like so:
public class SimDb {
// ...
private static final Object fileLock = new Object();
//...
SimDb(SimId simId, String actor, String transaction, boolean openToLastTransaction) {
//...
if (openToLastTransaction) {
openMostRecentEvent(actor, transaction)
} else {
// SimDb Line: 115
File eventDir;
synchronized (fileLock) {
eventDate = new Date();
eventDir = mkEventDir(eventDate);
eventDir.mkdirs();
}
Serialize.out(new File(eventDir, "date.ser"), eventDate);
}
}
// ...
}
@mratzenb Thanks for your investigation, I believe you have found the root cause of the reported issue.
Additional details based on the investigation above:
The problem has two parts: 1) Clobbered SimDB Event folder by the same current timestamp creation and 2) Unreliable response message relatesTo/messageId correlation.
For the first part, the contributed code seems to mitigate the issue. I have run the SoapUI tests against the updated code, and looking at the SimDB Event folders created, the race condition mitigation is evident when creating two separate event folders based on the same timestamp. Example: 2023_06_16_11_08_42_622 and 2023_06_16_11_08_42_622_1. In this case, the appended integer counter (i.e., "_1") to the folder name is made unique by the mkEventDir
method once it is given a chance. This was how the fix was verified in my local environment.
Second, as it was suggested as an "additional improvement", I agree that the request message validation can use an in-memory content reference rather than re-loading from it from the SimDB Event folder on disk.
In the IgActorSimulator and IgxActorSimulator an AdhocQueryResponseGenerator is set to a field called ahqrg. This can lead to mixed up responses when multiple requests are handled in parallel.