microsoft / fhir-loader

Bulk FHIR Data Loader
MIT License
44 stars 39 forks source link

Issue processing .ndjson files added to a sub-folder in the ndjson container #38

Closed lapellaniz closed 11 months ago

lapellaniz commented 1 year ago

Describe the Issue The FHIR bulk loader fails to process NDJSON files added in subfolders. This subfolder support is required when writing from PySpark. The bug is found in ImportNDJSONQueue.cs since the code assumes that blobs are always under the root though the event grid subscription is configured to pick up anything under ndjson container regardless of nesting.

Steps to reproduce:

  1. Create a subfolder under the ndjson container.
  2. Add an .ndjson file
  3. Review the NDJSON Event Grid trigger and NDJSON Queue trigger logs and see that they both fired.
  4. Review the NDJSON queue and verify that a message was written.
  5. Review the NDJSON queue trigger logs and see a warning that the message was skipped b/c the file was not found at the root of the ndjson container.
  6. Review the code in FHIRBulkImport/ImportNDJSONQueue.cs on line 28 - https://github.com/microsoft/fhir-loader/blob/5bed28f399ba4099b7dab990d17a8fba77430eae/FHIRBulkImport/ImportNDJSONQueue.cs#LL28C25-L28C25. This code extracts the name from the url and then tries to find the file at the root of NDJSON which is incorrect. The file was dropped in a subfolder. The code should either use the incoming url or take everything to the right of the container name in the url when trying to load the file.

The following shows the queue logs and message that a file in a subfolder triggered the function but the function failed to find it at the ROOT of the container.

NDJSON Queue log entry:

2023-05-25T15:28:07Z   [Information]   NDJSONConverter: Processing blob at https://{storage_account_name}.blob.core.windows.net/ndjson/test/part-00002-72aa37bc-2d0d-431c-9e65-36ef92cea748-c000.json...
2023-05-25T15:28:07Z   [Warning]   ImportNDJSONQueue:The blob part-00002-72aa37bc-2d0d-431c-9e65-36ef92cea748-c000.json in container ndjson does not exist or cannot be read.

NDJSON QUeue Message:

{
  "topic": "/subscriptions/{Subscription-ID}/resourceGroups/Non-Prod-Regional-RG-MS/providers/Microsoft.Storage/storageAccounts/nhshdreg01pocstore",
  "subject": "/blobServices/default/containers/ndjson/blobs/test/part-00002-72aa37bc-2d0d-431c-9e65-36ef92cea748-c000.json",
  "eventType": "Microsoft.Storage.BlobCreated",
  "id": "aad4ce79-001e-003f-571d-8f97ae060b6f",
  "data": {
    "api": "PutBlob",
    "clientRequestId": "c7873ffa-9a57-47b9-7001-425e2238674a",
    "requestId": "aad4ce79-001e-003f-571d-8f97ae000000",
    "eTag": "0x8DB5D349E755E60",
    "contentType": "application/octet-stream",
    "contentLength": 2413261,
    "blobType": "BlockBlob",
    "url": "https://{storage_account_name}.blob.core.windows.net/ndjson/test/part-00002-72aa37bc-2d0d-431c-9e65-36ef92cea748-c000.json",
    "sequencer": "0000000000000000000000000000BE82000000000431e7bb",
    "storageDiagnostics": {
      "batchId": "2b5c2fb4-6006-00af-001d-8fadc0000000"
    }
  },
  "dataVersion": "",
  "metadataVersion": "1",
  "eventTime": "2023-05-25T15:27:58.1231712Z",
  "$AzureWebJobsParentId": "6046068a-db9e-4b32-9fb4-98b0aba5e3d0"
}
mikaelweave commented 1 year ago

@evachen96 @sordahl-ga @erikhoward FYI

evachen96 commented 11 months ago

@lapellaniz Closing this issue with the above mentioned fix, please reopen and let us know if you have any more issues!