Open rbroth opened 1 year ago
A problem I've run into: JSON schemas are made to validate data in JSON format. Ergo, before we can validate data we need to load it and convert to JSON. The use of CSV presents something of a problem:
,,
as an empty string, others as null
We have a first version of JSON schemas for data import. We also have an initial version of a function that checks a csv file against the JSON schemas (validate_against_json_schema()
). However, the function isn't currently being used in the import process.
Once the JSON schemas are being used, we can delete the view bad_biomarker_vals
, which is currently used to for validating biomarker data
NA
for Null values could trip us up later, we need to ensure that there are none
Current problem:
The import script uses the Python csv library to read the CSV files into memory as Python dictionaries. It then runs the python jsonschema library against those dictionaries to check validity. The problem comes with the automatic data type conversion:
Each row read from the csv file is returned as a list of strings. No automatic data type conversion is performed unless the QUOTE_NONNUMERIC format option is specified (in which case unquoted fields are transformed into floats).
So, either all fields will be loaded as strings, or we need to ensure that string fields are quoted and non-string fields are not quoted.
Alternatives:
I've decided to go for option 2. The idea is to load the csv as string, then use the JSON schema to determine what attributed should have the "number" datatype, then cast those attributes to floats, and then finally check the data against the JSON schema.
Draft MR for me to continue working on next sprint: https://kwvmxgit.ad.nerc.ac.uk/bmgf-maps/data/db-test-data/-/merge_requests/80
We want to create a simple way for the scientists to check whether their csv files can be imported into the database/system. JSON schemas look like a good way of doing this.