Preprocessed Data Lake - Githubissues

lshtm-gis / WHO_PHSM_Cleaning

Cleaning PHSM provider data for WHO

https://lshtm-gis.github.io/WHO_PHSM_Cleaning/html/

MIT License

0 stars 1 forks source link

Preprocessed Data Lake #175

Open hamishgibbs opened 3 years ago

hamishgibbs commented 3 years ago

Need a data lake somewhere (SQL db)? (Mongo db)? Where we can store the original record corresponding to a certain record included in our dataset. This should be referenced by uuid and who_id.

Loading this DB probably fits best in the process.py individual files with a PK of the uuid. Before key mapping is applied but after the generation of a new record.

Leaning towards a mongo DB as records will already be divided into individual rows but I am open to discussion.

todowede commented 3 years ago

Moving check for new records using prov_id and unique keyword combinations to the preprocessing stage. Existing check on Oxford data will still be required due to possible change in who measure.