psych-ds / psychds-validator

Validator tools for the psych-DS specification
1 stars 1 forks source link

Order of validation process needs to be optimized #61

Open bleonar5 opened 2 months ago

bleonar5 commented 2 months ago

In light of issue #43, it will probably be necessary to rework the validation process to be more in accordance with whatever we decide design-wise for the UI output. At the current moment, the validate function returns all of its output once processing is complete, so the ordering of component processes was more influenced by a desire for parsimony with respect to the use of the yaml schema model.

Since processing time / memory usage is a looming concern as these integrations (Psych-DS / jsPsych) introduce the validator to larger and larger datasets, implementing an output system around an event emitter and ordering components more "chronologically" will be useful from a number of angles.

I don't want to modify things too excessively or turn the validator function into too much of a spaghetto of discrete processes, so I'll keep that in mind during the design process.

When I begin working on this more, I'll be using this suggestion from Melissa as a reference point, with the goal of developing something more granular and comprehensive:

***********************************************************
This dataset does not appear to be psych-DS compatible
***********************************************************

* [X] Project folder ('/User/myname/whatever/MyProjectFolder/') has been crawled
* [X] Project folder contains data/ subdirectory
* [X] Project folder contains dataset_description.json metadata file
* [X] The dataset_description.json file was parsed!
* [X] The data/ subfolder contains at least one CSV file
* [!] No CSV files found with validated filename structures
* [ ] Parse and validate CSV files
* [ ] Compare CSV files and metadata
bleonar5 commented 2 months ago

Here's one of the main issues with reordering the validation logic: rather than looping through a sequence of rules and checking them in order against each file, our current system loops through all the files and checks each rule against them in turn. Another issue is that the UI we're looking for here is based on positive feedback (data_dir_found, metadata_found, etc) whereas the logic of the validator is based on negative feedback (data_dir_missing, metadata_missing).

I don't think there's a justification to reworking the whole logic of the app, so we'll probably be coming up with some way of either simulating the confirmation of these steps after the fact or inserting additional little harmless shims that confirm the presence of certain objects redundantly (we already do this for one rule, because we have a need to find and parse the metadata file manually before the main rule checking loop begins in order to get the schema namespace and the variableMeasured, which are required for the evaluation of certain rules).

A possible rework that would be at least a little more involved would be to keep the negative feedback design but to rework the main loop so that it iterates through the ruleset instead of the filetree. That way, once we've checked the whole tree for a given rule and not produced an error, we can throw a piece of positive feedback. To make this work, we could add an "order" field to the rules in the psychds schema and use it to sort the ruleset before looping.

mekline commented 2 months ago

Ah, I see. That makes sense and I agree we should not rework the whole app for this! I'm comfortable with this list not reflecting the actual execution of validation, so long as it reflects a conceptually reasonable order that the user should go about tackling issues. That is, they probably should figure out if they are in the right directory before they try to format their CSV; they probably should format their CSV before they try to see if the headers matche the variableMeasured list, etc..

What about this - can we categorize errors as indicating the failure of these steps, and then assign X or ! depending on what errors are returned? Or semi equivalently, keep a running tally of flags to set when various conditions are met? (ie. - attempted to parse a CSV Y/N; successfully parsed it Y/N) Maybe that is what you meant by shims...?

On Fri, Aug 16, 2024 at 10:31 AM bleonar5 @.***> wrote:

Here's one of the main issues with reordering the validation logic: rather than looping through a sequence of rules and checking them in order against each file, our current system loops through all the files and checks each rule against them in turn. Another issue is that the UI we're looking for here is based on positive feedback (data_dir_found, metadata_found, etc) whereas the logic of the validator is based on negative feedback (data_dir_missing, metadata_missing).

I don't think there's a justification to reworking the whole logic of the app, so we'll probably be coming up with some way of either simulating the confirmation of these steps after the fact or inserting additional little harmless shims that confirm the presence of certain objects redundantly (we already do this for one rule, because we have a need to find and parse the metadata file manually before the main rule checking loop begins in order to get the schema namespace and the variableMeasured, which are required for the evaluation of certain rules).

A possible rework that would be at least a little more involved would be to keep the negative feedback design but to rework the main loop so that it iterates through the ruleset instead of the filetree. That way, once we've checked the whole tree for a given rule and not produced an error, we can throw a piece of positive feedback. To make this work, we could add an "order" field to the rules in the psychds schema and use it to sort the ruleset before looping.

— Reply to this email directly, view it on GitHub https://github.com/psych-ds/psychds-validator/issues/61#issuecomment-2293619475, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUA75NN4TK6LHFITC7CCJTZRYELFAVCNFSM6AAAAABMG3UXZSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJTGYYTSNBXGU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

bleonar5 commented 2 months ago

Current order:

bleonar5 commented 2 months ago

Progress markers that could conceivably be confirmed during, but before the end, of file loop, since they are based on finding at least one valid element, rather than confirming the absence of any invalid element:

Progress markers that, in the current logic, must wait for the loop to complete:

bleonar5 commented 2 months ago

Here is my proposal for a minimally changed ordering of positive feedback markers. I'll err on the side of excessively informative:

We can think of this process as being partially, or "softly", chronological. Progress marker events will emit when the validation process reaches a point (we could call it a "gate" rather than a "shim") at which it must be true that the corresponding error will not be found further along in the process. This framing is convoluted for things like "dataset_description found" (i.e. once dataset_description is found, it is no longer possible to incur the DATASET_DESCRIPTION_MISSING error), but less trivial for things like "all CSV datafiles are valid", which can't be gated until the walkFileTree process finishes.

The display of these progress markers will be mediated by a meta-process that awaits them in order. So, it keeps a record of every event that occurs, but only immediately propagates a change in the UI if the event pertains to the next step in the UI sequence. Then, when progressing to the next step, it checks to see if the next event has already occurred, and otherwise awaits its completion.

If a gate is reached and the relevant error has already been found, the meta process (and maybe the validation process?) halts and displays the relevant error information in the space below the step in question.

(in-progress versus completed texts separated by /)

bleonar5 commented 2 months ago

For the special "Check all CSV data files" and "Check all metadata sidecar files" steps, we can also emit "negative" events for conditions that preclude the completion of some downstream "positive" event. For instance, if we find the first CSV file and it successfully validates, but then we find something invalid about a file that is checked later in the process, we can emit a failure event at that point instead of waiting to reach the "success" gate to confirm the absence of CSV errors at the end of the process.

I guess this raises a point of, if multiple invalid CSVs are found, do we reference all of them in our error output, or do we maintain a logic of only presenting one error at a time?

mekline commented 2 months ago

Okay! This looks good to me, the "slash" notation is really helpful for understanding what's going on.

Specific notes interspersed below.

QUESTION: When does this first one fail? If the filepath is malformed or something else?

NEW: When all the subsidiary are true, give the message:

NEW: Proposing alternate wording, because I think the "at least one" framing is conceptually confusing as a milestone. Also, I would argue for factoring out the variableMeasured flag as its own message/milestone. I understand (I think) that those two processes will always resolve at the same time, i.e. once crawling the whole filetree is completed, and I think that's fine.

Then, these subsidiary ones determine whether to show the "All CSV files are valid" flag:

(Note, just adding the word "Optional" to this last one to avoid confusing people)

On Fri, Aug 16, 2024 at 1:35 PM bleonar5 @.***> wrote:

Here is my proposal for a minimally changed ordering of positive feedback markers. I'll err on the side of excessively informative:

We can think of this process as being partially, or "softly", chronological. Progress marker events will emit when the validation process reaches a point (we could call it a "gate" rather than a "shim") at which it must be true that the corresponding error will not be found further along in the process. This framing is convoluted for things like "dataset_description found" (i.e. once dataset_description is found, it is no longer possible to incur the DATASET_DESCRIPTION_MISSING error), but less trivial for things like "all CSV datafiles are valid", which can't be gated until the walkFileTree process finishes.

The display of these progress markers will be mediated by a meta-process that awaits them in order. So, it keeps a record of every event that occurs, but only immediately propagates a change in the UI if the event pertains to the next step in the UI sequence. Then, when progressing to the next step, it checks to see if the next event has already occurred, and otherwise awaits its completion.

If a gate is reached and the relevant error has already been found, the meta process (and maybe the validation process?) halts and displays the relevant error information in the space below the step in question.

(in-progress versus completed texts separated by /)

  • Check project folder / Project folder found
    • Crawl project folder and construct file tree / Project folder crawled and file tree constructed
  • Find metadata file / Metadata file "dataset_description.json" found in the root folder
    • Check metadata file for utf-8 encoding / Metadata file is utf-8 encoded
    • Parse metadata file as JSON / Metadata file parsed successfully
    • Check metadata file for required "name", "description", and "variableMeasured" fields / Metadata file contains required "name", "description", and "variableMeasured" fields.
    • Validate metadata file as JSON-LD / Metadata file is valid JSON-LD
    • Complete Schema.org type check / Metadata file passed Schema.org type check
  • Find "data" subfolder / "data" subfolder found in the root folder
  • Find a CSV data file in "data" subfolder / CSV data file found in "data" subfolder
    • Check filename for "_data" suffix / Filename uses "_data" suffix
    • Check filename for keyword formatting / Filename uses valid keyword formatting
    • Parse data file as CSV / Data file successfully parsed as CSV
    • Check for header line / Header line found
    • Confirm that all column headers are found in "variableMeasured" metadata field / All column headers were found in "variableMeasured" metadata field
    • Check all lines for equal number of cells / All lines have equal number of cells
    • Check for row_id column / (row_id column not found / row_id column found)
      • (if row_id found) Check row_id column for unique values / All values in row_id column are unique
    • Check all CSV data files / All CSV data files are valid
  • Check for metadata sidecar files / All metadata sidecar files are valid

— Reply to this email directly, view it on GitHub https://github.com/psych-ds/psychds-validator/issues/61#issuecomment-2293897825, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUA75PIMS4QNFIYGJDDW6LZRYZ7VAVCNFSM6AAAAABMG3UXZSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJTHA4TOOBSGU . You are receiving this because you commented.Message ID: @.***>

bleonar5 commented 2 months ago

Sounds good, I can integrate these changes. To answer your question, the first check fails if the dataset argument is not a valid directory or if nothing is found at the specified path.