globaldothealth / fhirflat

ISARIC 3.0 Pipeline - FHIRFlat
https://fhirflat.readthedocs.io
MIT License
0 stars 0 forks source link

Speed: take 2 #64

Closed pipliggins closed 2 months ago

pipliggins commented 3 months ago

Different approach to increasing ingestion speed.

This PR uses joblib to parallelise the ingestion across resources. The current two time sinks are:

Without parallelisation, current time is:

Patient took 0.88 seconds to convert 67 rows. 
Encounter took 0.74 seconds to convert 67 rows. 
2 resources not created due to validation errors. Errors saved to encounter_errors.csv
Observation took 1.81 seconds to convert 2194 rows. 
Condition took 1.07 seconds to convert 461 rows. 
Total time: 4.4922334590228274

With parallelisation one, the speed up is:

Patient took 0.78 seconds to convert 67 rows. 
Encounter took 0.73 seconds to convert 67 rows. 
2 resources not created due to validation errors. Errors saved to encounter_errors.csv
Observation took 1.71 seconds to convert 2194 rows. 
Condition took 0.86 seconds to convert 461 rows. 
Total time: 3.185586916981265

(readout edited for easy comparison)

codecov-commenter commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 97.70%. Comparing base (627c07e) to head (e809c1e). Report is 3 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #64 +/- ## ========================================== - Coverage 97.70% 97.70% -0.01% ========================================== Files 40 42 +2 Lines 1962 2047 +85 ========================================== + Hits 1917 2000 +83 - Misses 45 47 +2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

pipliggins commented 3 months ago

While I think there might be more speed increases by re-writing some of the dictionary creation, I'm going to see how slow it gets when creating the 'full' dengue mapping files, in comparison to the time taken to read in the data files, before worrying about it too much more.