google / fhir-data-pipes

A collection of tools for extracting FHIR resources and analytics services on top of that data.
https://google.github.io/fhir-data-pipes/
Apache License 2.0
153 stars 86 forks source link

Unable to create Individual columns of nested array of objects in a dataframe #760

Closed suriyan3 closed 4 weeks ago

suriyan3 commented 1 year ago

I am creating view here

views = r4.base_r4() qrs = views.view_of('QuestionnaireResponse')

QuestionnaireResponse_df = runner.to_dataframe( qrs.select( { "qrs_id": qrs.id, "qrs_tag": qrs.meta.tag.code, "qrs_patient_id":qrs.subject.reference, "qrs_encounter_id": qrs.encounter.reference, "qrs_questionnaire": qrs.questionnaire, "qrs_item":qrs.item } ) ) In this dataframe qrs_item column holds string of array of objects , I am not able to directly convert that to array of objects and assign it to qrs_item or assign individual objects to different columns like the below

runner.to_dataframe( qrs.select( { "qrs_item":qrs.item[0][0] "qrs_item_2.0":"qrs_item[0][1] } )

so I needed an helper function to convert that string to json

Function to convert JSON string to list

def convert_to_list(json_str): try: return json.loads(json_str) except json.JSONDecodeError: return [] QuestionnaireResource_df.zip

Convert the qrs_item column to list

QuestionnaireResponse_df ['qrs_item'] = QuestionnaireResponse_df ['qrs_item'].apply(convert_to_list)

And if I want to run an expression on this qrs_item I need to loop through all the items which is difficult if the item has many nested items. Is there any effective solution to solve this issue.

suriyan3 commented 1 year ago

I have attached the sample data and queries for reference.

bashir2 commented 1 year ago

A few points:

suriyan3 commented 1 year ago
  1. Actually this QR table will be created with the necessary columns and then it will be joined with Encounter table. On the final table we will be running expressions to get the values. Can I run FHIRPATH statements directly on the Parquet files? I have tried running FHIRPath statements on parquet files and it was not working.
  2. Yes I have tried writing direct SQL queries against the Parquet files using DBeaver but facing issues to create columns for nested items.
bashir2 commented 1 year ago

Re 1: Well a subset of FHIRPath functions are supported, that's the way that we define each "column" in a view. For example, take a look at examples here where the first() function of FHIRPath is used to choose one member of a collection. Are you saying that even simple examples like first() does not work on Parquet+Spark OR is it more advanced FHIRPaths that don't work? If it is the latter, can you provide some examples?

suriyan3 commented 1 year ago

Ok, So when I create a dataframe using

QuestionnaireResponse_df = runner.to_dataframe( qrs.select( { "qrs_id": qrs.id, "qrs_tag": qrs.meta.tag.code, "qrs_patient_id":qrs.subject.reference, "qrs_encounter_id": qrs.encounter.reference, "qrs_questionnaire": qrs.questionnaire, "qrs_item":qrs.item } ) )

when I do QuestionnaireResponse_df.iloc[0]['item'] it gives me this nested item as a string value

'[{"id":null,"linkId":"vitals","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"3.1","definition":null,"text":"ANC service provided at "},{"id":null,"linkId":"service-type-details","definition":null,"text":null},{"id":null,"linkId":"machine-available","definition":null,"text":"Is BP Machine Available? "},{"id":null,"linkId":"3.2","definition":null,"text":null},{"id":null,"linkId":"3.3","definition":null,"text":null},{"id":null,"linkId":"3.4","definition":null,"text":null},{"id":null,"linkId":"3.5","definition":null,"text":null},{"id":null,"linkId":"3.6","definition":null,"text":"Blood Group "},{"id":null,"linkId":"3.10","definition":null,"text":null},{"id":null,"linkId":"3.11","definition":null,"text":"Rh factor "},{"id":null,"linkId":"3.12","definition":null,"text":null}]},{"id":null,"linkId":"counselling","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"4.1","definition":null,"text":"Was counselling done for the following"},{"id":null,"linkId":"4.2","definition":null,"text":null},{"id":null,"linkId":"4.4","definition":null,"text":null},{"id":null,"linkId":"4.6","definition":null,"text":null}]},{"id":null,"linkId":"syphilis","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"5.1.0","definition":null,"text":null},{"id":null,"linkId":"5.3.0","definition":null,"text":null}]},{"id":null,"linkId":"hepatitis-B","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"6.1.0","definition":null,"text":null},{"id":null,"linkId":"6.3.0","definition":null,"text":null}]},{"id":null,"linkId":"hepatitis-C","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"7.1.0","definition":null,"text":null},{"id":null,"linkId":"7.3.0","definition":null,"text":null}]},{"id":null,"linkId":"hiv","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"8.1.0","definition":null,"text":null}]},{"id":null,"linkId":"malaria","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"malaria-1.0","definition":null,"text":null},{"id":null,"linkId":"malaria-3.0","definition":null,"text":null}]},{"id":null,"linkId":"SickleCellTest","definition":null,"text":"Sickle Cell and HbsAg test","answer":null,"item":[{"id":null,"linkId":"9.1.0","definition":null,"text":null},{"id":null,"linkId":"9.3.0","definition":null,"text":null}]},{"id":null,"linkId":"hematology","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"10.1.0","definition":null,"text":null}]},{"id":null,"linkId":"gestational-diabetes","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"11.1","definition":null,"text":null}]},{"id":null,"linkId":"urinalysis","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"12.1.0","definition":null,"text":null}]},{"id":null,"linkId":"Foetal-Examination","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"13.1","definition":null,"text":"Foetal Movement"},{"id":null,"linkId":"13.3","definition":null,"text":null},{"id":null,"linkId":"13.5","definition":null,"text":null},{"id":null,"linkId":"13.6","definition":null,"text":null},{"id":null,"linkId":"13.7","definition":null,"text":"Foetal Position"},{"id":null,"linkId":"13.11","definition":null,"text":null}]},{"id":null,"linkId":"hemanitic-calcium-supplements","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"15.1.0","definition":null,"text":null},{"id":null,"linkId":"15.2.0","definition":null,"text":null}]},{"id":null,"linkId":"llin-and-ipt-test","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"16.1.0","definition":null,"text":null},{"id":null,"linkId":"16.2.0","definition":null,"text":"Was IPT given?"}]},{"id":null,"linkId":"tetanus-diptheria","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"17.1.0","definition":null,"text":"Was Td given?"}]},{"id":null,"linkId":"associated-problems","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"18.1.0","definition":null,"text":"Associated problems"}]},{"id":null,"linkId":"previous-medical-history","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"19.1.0","definition":null,"text":"Previous medical history"}]},{"id":null,"linkId":"outcome-of-visit","definition":null,"text":null,"answer":null,"item":[{"id":null,"linkId":"20.1","definition":null,"text":null}]}]'

while creating the dataframe itself when I change to list or json system restarts or system goes unresponsive QuestionnaireResponse_df = runner.to_dataframe( qrs.select( { "qrs_id": qrs.id, "qrs_tag": qrs.meta.tag.code, "qrs_patient_id":qrs.subject.reference, "qrs_encounter_id": qrs.encounter.reference, "qrs_questionnaire": qrs.questionnaire, "qrs_item":list(qrs.item) // json.loads(qrs.item) } ) )

bashir2 commented 1 year ago

Update: what I meant by FHIRPath constraints was, for example, like this:

qr.select(
        {
            "id": qr.id,
            "answer1_0": qr.item.where(qr.item.linkId=='1.0').answer.first().value.ofType('coding').code,
        }
    )

This works and the advantage is that it picks a specific answer and assigns it to a single "column" so you don't need to deal with json/arrays, etc. However, when I tried this with qr.item.item it failed. It seems to be a bug in the fhir-views code. I have informed the team and they are looking into it.

In the meantime doing the same with SQL is probably your best bet. It also avoids loading everything into a dataframe in memory.

bashir2 commented 1 year ago

Update: The mentioned bug with qr.item.item was fixed by this commit (thanks to @rbrush) but then the current Spark runner implementation failed at generating proper SQL for constraints on qr.item.item. This is another issue which we need to look into. However, I like to suggest to reconsider doing direct SQL in the meantime. Here are some examples that hopefully should help:

The basic idea is to EXPLODE repeated columns to get a flat view and then apply required constraints. In your case, because you have answer fields at different levels, (e.g., item.item.answer vs item.item.item.answer), this is a little bit trickier but still doable. First, if you try this query (please pay attention to whether you have a _ in your table name or not):

SELECT QR.id,
  (CASE item_2.linkId WHEN '1.1' THEN item_2.answer[0].value.coding.code ELSE NULL END) AS answer_1_1,
  (CASE item_2.linkId WHEN '3.1' THEN item_2.answer[0].value.coding.code ELSE NULL END) AS answer_3_1,
  (CASE item_3.linkId WHEN '1.2.5' THEN item_3.answer[0].value.coding.code ELSE NULL END) AS answer_1_2_5,
  (CASE item_4.linkId WHEN 'v-value' THEN item_4.answer[0].value.coding.code ELSE NULL END) AS answer_v_value
FROM questionnaire_response AS QR
  LATERAL VIEW OUTER EXPLODE(QR.item) AS item_1
  LATERAL VIEW OUTER EXPLODE(item_1.item) AS item_2
  LATERAL VIEW OUTER EXPLODE(item_2.item) AS item_3
  LATERAL VIEW OUTER EXPLODE(item_3.item) AS item_4

you will see that for each QR.id, you get many rows. In the four answer columns, there are a lot of NULL values but there are also the corresponding answers. So we just need to aggregate these as follows:

SELECT QR.id,
  FIRST((CASE item_2.linkId WHEN '1.1' THEN item_2.answer[0].value.coding.code ELSE NULL END), true) AS answer_1_1,
  FIRST((CASE item_2.linkId WHEN '3.1' THEN item_2.answer[0].value.coding.code ELSE NULL END), true) AS answer_3_1,
  FIRST((CASE item_3.linkId WHEN '1.2.5' THEN item_3.answer[0].value.coding.code ELSE NULL END), true) AS answer_1_2_5,
  FIRST((CASE item_4.linkId WHEN 'v-value' THEN item_4.answer[0].value.coding.code ELSE NULL END), true) AS answer_v_value
FROM questionnaire_response AS QR
  LATERAL VIEW OUTER EXPLODE(QR.item) AS item_1
  LATERAL VIEW OUTER EXPLODE(item_1.item) AS item_2
  LATERAL VIEW OUTER EXPLODE(item_2.item) AS item_3
  LATERAL VIEW OUTER EXPLODE(item_3.item) AS item_4
GROUP BY QR.id;

We are using the FIRST aggregation function, dropping NULL values. I am not sure about the performance of this approach but it at least unblocks you.

bashir2 commented 1 year ago

Just to clarify re performance: A major concern is the GROUP BY part which is not really required to achieve what we are doing here; i.e., all answer elements should be extracted in the context of a single row without requiring a table-wide GROUP BY. I believe that's how fhir-views work but its generated SQL is too complex, hence this simpler approach.

suriyan3 commented 1 year ago

Thanks @bashir I have tested this SQL queries it is working fine, we can capture the columns we need and create a flat table. Is it possible to write a query that can loop through all the objects and in each objects there might be nested items and answer so if answer is not null then can we create a column name from the value of text key and its column value be taken from answer array.

bashir2 commented 4 weeks ago

Besides the solution offered above, now with the ViewDefinition support we can create flat-views directly from the original QuestionnaireResponse resources (with FHIRPaths to extract item and answer at arbitrary levels). Closing this issue ...