Need a data lake somewhere (SQL db)? (Mongo db)? Where we can store the original record corresponding to a certain record included in our dataset. This should be referenced by uuid and who_id.
Loading this DB probably fits best in the process.py individual files with a PK of the uuid. Before key mapping is applied but after the generation of a new record.
Leaning towards a mongo DB as records will already be divided into individual rows but I am open to discussion.
Moving check for new records using prov_id and unique keyword combinations to the preprocessing stage. Existing check on Oxford data will still be required due to possible change in who measure.
Need a data lake somewhere (SQL db)? (Mongo db)? Where we can store the original record corresponding to a certain record included in our dataset. This should be referenced by
uuid
andwho_id
.Loading this DB probably fits best in the
process.py
individual files with a PK of the uuid. Before key mapping is applied but after the generation of a new record.Leaning towards a mongo DB as records will already be divided into individual rows but I am open to discussion.