usnistgov / iheos-toolkit2

XDS Toolkit
https://ihexds.nist.gov
45 stars 25 forks source link

Racecondition: Different requests get same response #583

Closed dbaumann2 closed 1 year ago

dbaumann2 commented 1 year ago

In the IgActorSimulator and IgxActorSimulator an AdhocQueryResponseGenerator is set to a field called ahqrg. This can lead to mixed up responses when multiple requests are handled in parallel.

dbaumann2 commented 1 year ago

see Pull Request #584

skbhaskarla commented 1 year ago

@dbaumann2, could you please upload some test artifacts that demonstrate the race condition? Thanks

dbaumann2 commented 1 year ago

Unfortunately I have no automated test demonstrating the issue. Here is some logging output from our application. Response message with ID 21 is expected to have Message ID "urn:uuid:0e5b9323-7076-47ca-b0f6-9d34c17d6eaf" and therefore cannot be correlated by our application. This led to us investigating the code of the toolkit. Hope this helps to understand the issue.

Message ID 21: Request, AdhocQueryRequest, MessageId urn:uuid:0e5b9323-7076-47ca-b0f6-9d34c17d6eaf Message ID 20: Request, AdhocQueryRequest, MessageId urn:uuid:452d35f4-ed04-4f6c-8230-bd70083b3e45 Message ID 19: Response... Message ID 22: Response... Message ID 20: Response, AdhocQueryResponse, RelatesTo urn:uuid:452d35f4-ed04-4f6c-8230-bd70083b3e45 Message ID 21: Response, AdhocQueryResponse, RelatesTo urn:uuid:452d35f4-ed04-4f6c-8230-bd70083b3e45

skbhaskarla commented 1 year ago

I understand the issue but without a test procedure that can consistently reproduce the problem, I wouldn't be able to confirm the bug fix.

dbaumann2 commented 1 year ago

Is a simple unit test sufficient or do you need an integration test?

skbhaskarla commented 1 year ago

Is a simple unit test sufficient or do you need an integration test?

Any of those type of tests is fine as long as it shows the toolkit generated MessageID is incorrect.

dbaumann2 commented 1 year ago

I've written a SoapUI Performance Test that can detect the error: image

ToolKit-Test-soapui-project.xml.zip

skbhaskarla commented 1 year ago

I built XDS Toolkit from your pull request, ran the load tests but according to the results in the screenshot below, the error condition still exists.

image

image

The good news is that your test is able to confirm the issue you reported but I'm afraid the pull request code did not solve the issue.

I am adding @andrewmccaffreynist to help look into the root cause of this issue.

dbaumann2 commented 1 year ago

We did not take a deeper look at the code. It just seemed to be the relevant place, but we unfortunately did not have time to compile and deploy it yet. Thank you very much for further investigating this issue!

mratzenb commented 1 year ago

We did some more investigation and found the problematic place. It is the creation of event directories, which are then used for saving the request, response and some other stuff.

When handling a request in gov/nist/toolkit/fhir/simulators/servlet/SimServlet.java a new gov/nist/toolkit/simcommon/server/SimDb.groovy instance is created every time. In the constructor, an event Directory with the current Date as name. This is already the problem, because if two SimDb instances get created with the same Date (which can happen, since the timestamp is made up of year.month.day.hours.minutes.seconds.milliseconds), everything serialized into the event directory will be overwritten. Hence, explaining the mismatch of messageID and relatesTo.

We "fixed" this issue by adding a new UUID to gov/nist/toolkit/installation/server/Installation.java#asFilenameBase like so:

    // Installation.java#asFilenameBase
    return val + "-"+ UUID.randomUUID();

Using the provided SoapUI project, we can see no errors anymore. However, we don't know the side effects this will have on the entire Toolkit.

An additional improvement would be to not rely on I/O operations for the request/response handling. For example, ThreadLocals could be used for storing the data within the current thread.

mratzenb commented 1 year ago

After some more digging around, a more elegant solution would be to synchronize during the file event creation inside SimDb like so:

public class SimDb {
    // ...
    private static final Object fileLock = new Object();
    //...
    SimDb(SimId simId, String actor, String transaction, boolean openToLastTransaction) {
        //...
        if (openToLastTransaction) {
            openMostRecentEvent(actor, transaction)
        } else {
            // SimDb Line: 115
                        File eventDir;
            synchronized (fileLock) {
                eventDate = new Date();
                eventDir = mkEventDir(eventDate);
                eventDir.mkdirs();
            }
            Serialize.out(new File(eventDir, "date.ser"), eventDate);
        }
    }
    // ...
}
skbhaskarla commented 1 year ago

@mratzenb Thanks for your investigation, I believe you have found the root cause of the reported issue.

skbhaskarla commented 1 year ago

Additional details based on the investigation above: The problem has two parts: 1) Clobbered SimDB Event folder by the same current timestamp creation and 2) Unreliable response message relatesTo/messageId correlation. For the first part, the contributed code seems to mitigate the issue. I have run the SoapUI tests against the updated code, and looking at the SimDB Event folders created, the race condition mitigation is evident when creating two separate event folders based on the same timestamp. Example: 2023_06_16_11_08_42_622 and 2023_06_16_11_08_42_622_1. In this case, the appended integer counter (i.e., "_1") to the folder name is made unique by the mkEventDir method once it is given a chance. This was how the fix was verified in my local environment.

Second, as it was suggested as an "additional improvement", I agree that the request message validation can use an in-memory content reference rather than re-loading from it from the SimDB Event folder on disk.