Create JSON schemas for import

micronutrientsupport / database-architecture

The Postgres database code for the MAPS tool

3 stars 0 forks source link

Create JSON schemas for import #356

Open rbroth opened 1 year ago

rbroth commented 1 year ago

We want to create a simple way for the scientists to check whether their csv files can be imported into the database/system. JSON schemas look like a good way of doing this.

rbroth commented 1 year ago

A problem I've run into: JSON schemas are made to validate data in JSON format. Ergo, before we can validate data we need to load it and convert to JSON. The use of CSV presents something of a problem:

Data types: JSON has a boolean datatype, csv does not. I've been futzing around with automatic converters (pandas), but maybe we should use `{ "type": ["number", "string"], "enum": [0,1,"True", "False"] } for boolean attributes
null values. Some csv libraries import ,, as an empty string, others as null

rbroth commented 1 year ago

We have a first version of JSON schemas for data import. We also have an initial version of a function that checks a csv file against the JSON schemas (validate_against_json_schema()). However, the function isn't currently being used in the import process.

[ ] Validate csv files during the import process
[ ] Possibly: modularize the validation function so that it's easy for the scientists to use themselves.

rbroth commented 1 year ago

Once the JSON schemas are being used, we can delete the view bad_biomarker_vals, which is currently used to for validating biomarker data

rbroth commented 1 year ago

NA for Null values could trip us up later, we need to ensure that there are none

rbroth commented 1 year ago

MS Teams discussion

rbroth commented 1 year ago

Current problem:

The import script uses the Python csv library to read the CSV files into memory as Python dictionaries. It then runs the python jsonschema library against those dictionaries to check validity. The problem comes with the automatic data type conversion:

Each row read from the csv file is returned as a list of strings. No automatic data type conversion is performed unless the QUOTE_NONNUMERIC format option is specified (in which case unquoted fields are transformed into floats).

So, either all fields will be loaded as strings, or we need to ensure that string fields are quoted and non-string fields are not quoted.

rbroth commented 1 year ago

Alternatives:

allow Python to load every field as a string and use regex to check that the various numeric fields look like numbers. The loading into the db works either way. Drawback: won't be able to use JSON schemas to enforce numbers being within a particular range.
Perform data conversion in Python. Benefit: more forgiving when receiving input, which will be useful in the future when we get user data. Drawback: complicated to code.

rbroth commented 1 year ago

I've decided to go for option 2. The idea is to load the csv as string, then use the JSON schema to determine what attributed should have the "number" datatype, then cast those attributes to floats, and then finally check the data against the JSON schema.

rbroth commented 1 year ago

Draft MR for me to continue working on next sprint: https://kwvmxgit.ad.nerc.ac.uk/bmgf-maps/data/db-test-data/-/merge_requests/80