orchid-initiative / synthetic-database-project

MIT License
4 stars 2 forks source link

Fix the fields that have been identified as needed in the summary statistics and that are not complete: prioritize the 'yellow' fields and then the 'blue' fields on the list on: https://docs.google.com/spreadsheets/d/1uWe-IaOa1SV7UIm4N8fs-ym2XtGVVCDMKyDE3fL8oQ8/edit#gid=1477458142 #37

Open Dior13 opened 1 year ago

rileeki commented 1 year ago

https://docs.google.com/spreadsheets/d/1uWe-IaOa1SV7UIm4N8fs-ym2XtGVVCDMKyDE3fL8oQ8/edit#gid=1477458142

reference https://hcai.ca.gov/wp-content/uploads/2022/12/IP-format-and-file-specs-jan-2023.pdf

TravisHaussler commented 1 year ago

My local runs are now working and creating the formatted data output, yay! I checked in the fixes for the various errors it hit. Once future people also try running this locally I can be available to help them and we can discover any other bugs on a non-identical setup probably, and also review the readmes along the way.

I am working on fields now. First thing I saw was the Race and Ethnicity mappings were swapped and white was missing from the dictionary so I'm testing that quick fix now and will check that in first before trying the first "ready to code" element.

TravisHaussler commented 1 year ago

For the Principal Diagnosis if found a map for SNOMED CT to ICD-10-CM Map available from the NIH. https://www.nlm.nih.gov/research/umls/mapping_projects/snomedct_to_icd10cm.html I had to submit a login request to access the map, which I have done now

UPDATE: We found that the mapping is not 1-1 or simple. Further, Synthea does not plan to offer ICD-10-CM output (https://github.com/synthetichealth/synthea/issues/403)

rileeki commented 1 year ago

Thoughts on the homeless indicator field:

Synthea has a homelessness module.

Based on the documentation, I think periods of homelessness should be identifiable in the conditions.csv output file as SNOMED-CT code 32911000 with associated Start and Stop dates.

We might look to see if the hospital admission date falls within any period of homelessness in that patient’s condition data… or that might be too cumbersome. Thoughts? @TravisHaussler

TravisHaussler commented 1 year ago

I will look into your question about homelessness.

Elswehere - while looking for the Present on Admission coding (which I could not find in any of the output/csv/ files) I found a new set of data that Synthea can produce using a setting in synthea_settings

https://github.com/synthetichealth/synthea/wiki/CPCDS-Export

we change exporter.cpcds.export = true and now I should be able to merge encounter IDs with cpcds claim IDs to access the PoA flags. I will try today or tomorrow to do the pandas work for that

TravisHaussler commented 1 year ago

In order to make progress getting something usable for a metrics dashboard we want to do 2 things as I understand it:

  1. Move forward with approximations for some of the harder to map data left (primarily ICD-10 codes).
  2. Investigate expanding the logic to compile ER visits correctly (currently the approach is directed towards inpatient) - this relates to riley's comment above about homelessness as well I believe.

For 1 I attempted to summarize the remaining fields in a new tab on the output specs sheet so we can discuss how each one can be pursued. I added a column for proposal for you @rileeki to check over.

In particular I think some questions are:

  1. Are we comfortable using a static copy of the first mapping possibility of every SNOMED code to ICD in our code base? If so this lookup is probably pretty easy to implement for a number of fields.
  2. For fields that allow multiple entries (for example the fields "other diagnosis and POA" & "External causes of Morbidity and POA") how should we aim to summarize the synthea data here - its possible its there in synthea, I just need to look more
rileeki commented 1 year ago

@TravisHaussler

I reviewed your proposals in the new tab on the output specs sheet. For the most part, I agree with all of your proposals. But, I think we can get by skipping most of them if they'll be cumbersome to code. We can get some practical proof-of-concept examples without some of these fields. And, as we discussed, when we want to do something that requires the fully fleshed out data, it might be more straightforward to code a new output option in Java and have Synthea spit it out directly.

  1. Yes, I think a static 1:1 copy of a simplified SNOMED-to-ICD mapping is fine for our current purposes.
  2. For the fields with multiple entries, I think we can skip for now.

Hopefully that won't be more than a few hours of work. In the meantime, I'll work on getting details together on the ER output needs.

TravisHaussler commented 10 months ago

Remaining missing fields are: