Order of validation process needs to be optimized

In light of issue #43, it will probably be necessary to rework the validation process to be more in accordance with whatever we decide design-wise for the UI output. At the current moment, the validate function returns all of its output once processing is complete, so the ordering of component processes was more influenced by a desire for parsimony with respect to the use of the yaml schema model.

Since processing time / memory usage is a looming concern as these integrations (Psych-DS / jsPsych) introduce the validator to larger and larger datasets, implementing an output system around an event emitter and ordering components more "chronologically" will be useful from a number of angles.

I don't want to modify things too excessively or turn the validator function into too much of a spaghetto of discrete processes, so I'll keep that in mind during the design process.

When I begin working on this more, I'll be using this suggestion from Melissa as a reference point, with the goal of developing something more granular and comprehensive:

***********************************************************
This dataset does not appear to be psych-DS compatible
***********************************************************

* [X] Project folder ('/User/myname/whatever/MyProjectFolder/') has been crawled
* [X] Project folder contains data/ subdirectory
* [X] Project folder contains dataset_description.json metadata file
* [X] The dataset_description.json file was parsed!
* [X] The data/ subfolder contains at least one CSV file
* [!] No CSV files found with validated filename structures
* [ ] Parse and validate CSV files
* [ ] Compare CSV files and metadata

Here's one of the main issues with reordering the validation logic: rather than looping through a sequence of rules and checking them in order against each file, our current system loops through all the files and checks each rule against them in turn. Another issue is that the UI we're looking for here is based on positive feedback (data_dir_found, metadata_found, etc) whereas the logic of the validator is based on negative feedback (data_dir_missing, metadata_missing).

I don't think there's a justification to reworking the whole logic of the app, so we'll probably be coming up with some way of either simulating the confirmation of these steps after the fact or inserting additional little harmless shims that confirm the presence of certain objects redundantly (we already do this for one rule, because we have a need to find and parse the metadata file manually before the main rule checking loop begins in order to get the schema namespace and the variableMeasured, which are required for the evaluation of certain rules).

A possible rework that would be at least a little more involved would be to keep the negative feedback design but to rework the main loop so that it iterates through the ruleset instead of the filetree. That way, once we've checked the whole tree for a given rule and not produced an error, we can throw a piece of positive feedback. To make this work, we could add an "order" field to the rules in the psychds schema and use it to sort the ruleset before looping.

Ah, I see. That makes sense and I agree we should not rework the whole app for this! I'm comfortable with this list not reflecting the actual execution of validation, so long as it reflects a conceptually reasonable order that the user should go about tackling issues. That is, they probably should figure out if they are in the right directory before they try to format their CSV; they probably should format their CSV before they try to see if the headers matche the variableMeasured list, etc..

What about this - can we categorize errors as indicating the failure of these steps, and then assign X or ! depending on what errors are returned? Or semi equivalently, keep a running tally of flags to set when various conditions are met? (ie. - attempted to parse a CSV Y/N; successfully parsed it Y/N) Maybe that is what you meant by shims...?

On Fri, Aug 16, 2024 at 10:31 AM bleonar5 @.***> wrote:

Here's one of the main issues with reordering the validation logic: rather than looping through a sequence of rules and checking them in order against each file, our current system loops through all the files and checks each rule against them in turn. Another issue is that the UI we're looking for here is based on positive feedback (data_dir_found, metadata_found, etc) whereas the logic of the validator is based on negative feedback (data_dir_missing, metadata_missing).

I don't think there's a justification to reworking the whole logic of the app, so we'll probably be coming up with some way of either simulating the confirmation of these steps after the fact or inserting additional little harmless shims that confirm the presence of certain objects redundantly (we already do this for one rule, because we have a need to find and parse the metadata file manually before the main rule checking loop begins in order to get the schema namespace and the variableMeasured, which are required for the evaluation of certain rules).

A possible rework that would be at least a little more involved would be to keep the negative feedback design but to rework the main loop so that it iterates through the ruleset instead of the filetree. That way, once we've checked the whole tree for a given rule and not produced an error, we can throw a piece of positive feedback. To make this work, we could add an "order" field to the rules in the psychds schema and use it to sort the ruleset before looping.

— Reply to this email directly, view it on GitHub https://github.com/psych-ds/psychds-validator/issues/61#issuecomment-2293619475, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUA75NN4TK6LHFITC7CCJTZRYELFAVCNFSM6AAAAABMG3UXZSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJTGYYTSNBXGU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Current order:

filetree is processed (it would be possible to manually insert logic here to detect when certain files/folders are found)
dataset_description is found
dataset_description is valid JSON
all other rules are stratified across the filetree loop and can only be confirmed positively once the loop has completed.

Progress markers that could conceivably be confirmed during, but before the end, of file loop, since they are based on finding at least one valid element, rather than confirming the absence of any invalid element:

data dir found
global metadata:
- is found
- is JSON
- is JSON-LD
- is utf-8
- passes schema.org type checking
- contains required keys
At least one CSV datafile:
- is found
- has valid name
- is valid CSV
- passes all CSV checks (including variableMeasured check)

Progress markers that, in the current logic, must wait for the loop to complete:

all metadata sidecars:
- are JSON
- are JSON-LD
- are utf-8
- pass schema.org type checking
all CSV files with the _data suffix :
- have valid keyword-formatted names
- pass all CSV checks (including variableMeasured)
there are NO empty files

Here is my proposal for a minimally changed ordering of positive feedback markers. I'll err on the side of excessively informative:

We can think of this process as being partially, or "softly", chronological. Progress marker events will emit when the validation process reaches a point (we could call it a "gate" rather than a "shim") at which it must be true that the corresponding error will not be found further along in the process. This framing is convoluted for things like "dataset_description found" (i.e. once dataset_description is found, it is no longer possible to incur the DATASET_DESCRIPTION_MISSING error), but less trivial for things like "all CSV datafiles are valid", which can't be gated until the walkFileTree process finishes.

The display of these progress markers will be mediated by a meta-process that awaits them in order. So, it keeps a record of every event that occurs, but only immediately propagates a change in the UI if the event pertains to the next step in the UI sequence. Then, when progressing to the next step, it checks to see if the next event has already occurred, and otherwise awaits its completion.

If a gate is reached and the relevant error has already been found, the meta process (and maybe the validation process?) halts and displays the relevant error information in the space below the step in question.

(in-progress versus completed texts separated by /)

Check project folder / Project folder found
- Crawl project folder and construct file tree / Project folder crawled and file tree constructed
Find metadata file / Metadata file "dataset_description.json" found in the root folder
- Check metadata file for utf-8 encoding / Metadata file is utf-8 encoded
- Parse metadata file as JSON / Metadata file parsed successfully
- Check metadata file for required "name", "description", and "variableMeasured" fields / Metadata file contains required "name", "description", and "variableMeasured" fields.
- Validate metadata file as JSON-LD / Metadata file is valid JSON-LD
- Complete Schema.org type check / Metadata file passed Schema.org type check
Find "data" subfolder / "data" subfolder found in the root folder
Find a CSV data file in "data" subfolder / CSV data file found in "data" subfolder
- Check filename for "_data" suffix / Filename uses "_data" suffix
- Check filename for keyword formatting / Filename uses valid keyword formatting
- Parse data file as CSV / Data file successfully parsed as CSV
- Check for header line / Header line found
- Confirm that all column headers are found in "variableMeasured" metadata field / All column headers were found in "variableMeasured" metadata field
- Check all lines for equal number of cells / All lines have equal number of cells
- Check for row_id column / (row_id column not found / row_id column found)
- (if row_id found) Check row_id column for unique values / All values in row_id column are unique
Check all CSV data files / All CSV data files are valid
Check for metadata sidecar files / All metadata sidecar files are valid

For the special "Check all CSV data files" and "Check all metadata sidecar files" steps, we can also emit "negative" events for conditions that preclude the completion of some downstream "positive" event. For instance, if we find the first CSV file and it successfully validates, but then we find something invalid about a file that is checked later in the process, we can emit a failure event at that point instead of waiting to reach the "success" gate to confirm the absence of CSV errors at the end of the process.

I guess this raises a point of, if multiple invalid CSVs are found, do we reference all of them in our error output, or do we maintain a logic of only presenting one error at a time?

Okay! This looks good to me, the "slash" notation is really helpful for understanding what's going on.

Specific notes interspersed below.

QUESTION: When does this first one fail? If the filepath is malformed or something else?

Check project folder / Project folder found
Find metadata file / Metadata file "dataset_description.json" found in the root folder

NEW: When all the subsidiary are true, give the message:

Parse "dataset_description.json" metadata file / Successfully parsed "dataset_description.json" metadata file
- Check metadata file for utf-8 encoding / Metadata file is utf-8 encoded
- Parse metadata file as JSON / Metadata file parsed successfully
- Check metadata file for required "name", "description", and "variableMeasured" fields / Metadata file contains required "name", "description", and "variableMeasured" fields.
- Validate metadata file as JSON-LD / Metadata file is valid JSON-LD
- Complete Schema.org type check / Metadata file passed Schema.org type check
Find "data" subfolder / "data" subfolder found in the root folder

NEW: Proposing alternate wording, because I think the "at least one" framing is conceptually confusing as a milestone. Also, I would argue for factoring out the variableMeasured flag as its own message/milestone. I understand (I think) that those two processes will always resolve at the same time, i.e. once crawling the whole filetree is completed, and I think that's fine.

Check for CSV data files in "data" subfolder / CSV data files found in "data" subfolder

Then, these subsidiary ones determine whether to show the "All CSV files are valid" flag:

Check filename for "_data" suffix / Filename uses "_data" suffix
- Check filename for keyword formatting / Filename uses valid keyword formatting
- Parse data file as CSV / Data file successfully parsed as CSV
- Check for header line / Header line found
Check all lines for equal number of cells / All lines have equal number of cells
- Check for row_id column / (row_id column not found / row_id column found)
  - (if row_id found) Check row_id column for unique values / All values in row_id column are unique
- Check all CSV data files / All CSV data files are valid
NEW/SEPARATED: Confirm that all column headers are found in "variableMeasured" metadata field / All column headers were found in "variableMeasured" metadata field
(Optional) Check for metadata sidecar files / All metadata sidecar files are valid

(Note, just adding the word "Optional" to this last one to avoid confusing people)

On Fri, Aug 16, 2024 at 1:35 PM bleonar5 @.***> wrote:

Here is my proposal for a minimally changed ordering of positive feedback markers. I'll err on the side of excessively informative:

We can think of this process as being partially, or "softly", chronological. Progress marker events will emit when the validation process reaches a point (we could call it a "gate" rather than a "shim") at which it must be true that the corresponding error will not be found further along in the process. This framing is convoluted for things like "dataset_description found" (i.e. once dataset_description is found, it is no longer possible to incur the DATASET_DESCRIPTION_MISSING error), but less trivial for things like "all CSV datafiles are valid", which can't be gated until the walkFileTree process finishes.

The display of these progress markers will be mediated by a meta-process that awaits them in order. So, it keeps a record of every event that occurs, but only immediately propagates a change in the UI if the event pertains to the next step in the UI sequence. Then, when progressing to the next step, it checks to see if the next event has already occurred, and otherwise awaits its completion.

If a gate is reached and the relevant error has already been found, the meta process (and maybe the validation process?) halts and displays the relevant error information in the space below the step in question.

(in-progress versus completed texts separated by /)

Check project folder / Project folder found

Crawl project folder and construct file tree / Project folder crawled and file tree constructed

Find metadata file / Metadata file "dataset_description.json" found in the root folder

Check metadata file for utf-8 encoding / Metadata file is utf-8 encoded

Parse metadata file as JSON / Metadata file parsed successfully

Check metadata file for required "name", "description", and "variableMeasured" fields / Metadata file contains required "name", "description", and "variableMeasured" fields.

Validate metadata file as JSON-LD / Metadata file is valid JSON-LD

Complete Schema.org type check / Metadata file passed Schema.org type check

Find "data" subfolder / "data" subfolder found in the root folder

Find a CSV data file in "data" subfolder / CSV data file found in "data" subfolder

Check filename for "_data" suffix / Filename uses "_data" suffix

Check filename for keyword formatting / Filename uses valid keyword formatting

Parse data file as CSV / Data file successfully parsed as CSV

Check for header line / Header line found

Confirm that all column headers are found in "variableMeasured" metadata field / All column headers were found in "variableMeasured" metadata field

Check all lines for equal number of cells / All lines have equal number of cells

Check for row_id column / (row_id column not found / row_id column found)

(if row_id found) Check row_id column for unique values / All values in row_id column are unique

Check all CSV data files / All CSV data files are valid

Check for metadata sidecar files / All metadata sidecar files are valid

— Reply to this email directly, view it on GitHub https://github.com/psych-ds/psychds-validator/issues/61#issuecomment-2293897825, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUA75PIMS4QNFIYGJDDW6LZRYZ7VAVCNFSM6AAAAABMG3UXZSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJTHA4TOOBSGU . You are receiving this because you commented.Message ID: @.***>

Sounds good, I can integrate these changes. To answer your question, the first check fails if the dataset argument is not a valid directory or if nothing is found at the specified path.

psych-ds / psychds-validator

Order of validation process needs to be optimized #61