hed-standard / hed-python

Python validation, summary, and analysis tools for HED (Hierarchical Event Descriptors).
https://hed-python.readthedocs.io/en/latest/
MIT License
10 stars 10 forks source link

seems to validate non-HED columns #1031

Open yarikoptic opened 2 days ago

yarikoptic commented 2 days ago

I had a run

❯ hed-validator -o logs/hed-validator.log .
Using HEDTOOLS version: {'version': '0+untagged.2394.g2159297', 'full-revisionid': '215929781a603d0c097dca8c38246acffb313d09', 'dirty': False, 'error': None, 'date': '2024-10-11T13:35:47-0500'}
Number of issues: 1785282
hed-validator -o logs/hed-validator.log .  390.58s user 4.47s system 100% cpu 6:34.92 total

with full file at http://www.oneukrainian.com/tmp/hed-validator-20241011-1.log.gz happen someone has a boring weekend. But many of the errors are of the form

Errors in file 'sub-0001_ses-04_task-fractional_acq-mb8_run-01_events.tsv'
        Issues in row 2:
                Issues in column duration:
                        hed string: 11.0
                                CHARACTER_INVALID: Invalid character '.' in tag '11.0'  Problem spans string indexes: 2, 3
                                TAG_INVALID: '11.0' in 11.0 is not a valid base HED tag.  Problem spans string indexes: 0, 4
                Issues in column onset:
                        hed string: 12.0532553100586
                                CHARACTER_INVALID: Invalid character '.' in tag '12.0532553100586'  Problem spans string indexes: 2, 3
                                TAG_INVALID: '12.0532553100586' in 12.0532553100586 is not a valid base HED tag.  Problem spans string indexes: 0, 16

whenever we have

❯ head -n 20 task-fractional_events.json
{
  "onset": {
    "LongName": "Onset time of event",
    "Description": "Marks the start of an ongoing event of temporal extent.",
    "Units": "s",
    "HED": "Property/Data-property/Data-marker/Temporal-marker/Onset"
  },
  "duration": {
    "LongName": "The period of time during which an event occurs. Refers to Image duration or response time after stimulus depending on event_type",
    "Description": "a. For falsebelief and falsephoto trial types, duration refers to the image presentations of falsebelief and falsephoto stories. b. For rating_falsebelief and rating_falsephoto, duration refers to the response time to answer true false questions, followed by falsebelief or flasephoto stimulu",
    "Units": "s",
    "HED": "Property/Data-property/Data-value/Spatiotemporal-value/Temporal-value/Duration"
  },
...

have we used HED incorrectly to provide semantic to those columns?

attn @jungheejung

VisLab commented 2 days ago

@yarikoptic @jungheejung -- sorry I didn't see this issue until just now -- if you continue to have issues, please re-post. Thx.

Several things.
1) The onset column should not have HED in it at all. The onset is treated as a special column and not annotated. 2) Duration requires a value. Annotate as Duration/# in the sidecar --- to represent one annotation that is applicable to the entire column. The # is replaced by the actual column value when assembled. It also requires that you say what is the duration of in parentheses. 3) Please use short forms of tags. 4) As a recommended strategy, it would be good to validate the sidecar (usually there is only one per dataset) using the online tools at https://hedtools.org/hed_dev/sidecar before trying to validate your dataset. 5) The error above is for the tsv file, which you didn't include so I can't be sure that this will be the only error.

The corrected form:

{
  "onset": {
    "LongName": "Onset time of event",
    "Description": "Marks the start of an ongoing event of temporal extent.",
    "Units": "s"
  },
  "duration": {
    "LongName": "The period of time during which an event occurs. Refers to Image duration or response time after stimulus depending on event_type",
    "Description": "a. For falsebelief and falsephoto trial types, duration refers to the image presentations of falsebelief and falsephoto stories. b. For rating_falsebelief and rating_falsephoto, duration refers to the response time to answer true false questions, followed by falsebelief or flasephoto stimulu",
    "Units": "s",
    "HED": "(Duration/#, (Label/Entire-event-time))"
  }
}

Note: I think we could do a more precise job of annotation using the curly brace notation --- If you respond with the entire JSON file, I would be happy to suggest modifications.