Speed up data transformation time

globaldothealth / fhirflat

ISARIC 3.0 Pipeline - FHIRFlat

https://fhirflat.readthedocs.io

MIT License

0 stars 0 forks source link

Speed up data transformation time #44

Closed pipliggins closed 2 weeks ago

pipliggins commented 3 months ago

67 subjects for the Dengue subset, converting ~30 observations/conditions:

Patient took 1.19 seconds to convert 67 rows.
Encounter took 1.31 seconds to convert 67 rows.
Observation took 6.85 seconds to convert 2194 rows.
Condition took 5.99 seconds to convert 461 rows.

abhidg commented 3 months ago

The resources are independent so can be parallelized, probably most of this is due to unpacking and repacking from fhir.resources which is yet to undergo a rewrite to support Pydantic v2 (which is much faster). Will take a look with a profiler.

Alternative could be to parse and validate in two steps - parsing and conversion should be relatively fast, as that would just need the mapping file, without any reference to fhirflat.resources. Validating would still need construction of resource object, but that could be parallelised across rows/resources.

pipliggins commented 2 months ago

Using pandarallel for ingest_to_flat() conversions to/from object format provides a significant speedup, even without parallelising across different resources -

Patient took 1.18 seconds to convert 67 rows. 
Encounter took 0.97 seconds to convert 67 rows. 
2 resources not created due to validation errors. Errors saved to encounter_errors.csv
Observation took 2.19 seconds to convert 2194 rows. 
Condition took 1.33 seconds to convert 461 rows.

abhidg commented 2 months ago

Also see https://pandas.pydata.org/docs/getting_started/install.html#performance-dependencies-recommended