hed-standard / hed-schemas

Repository for HED schemas. The SCORE library for clinical annotations is now available.
https://hed-schemas.readthedocs.io/en/latest/
Creative Commons Attribution 4.0 International
1 stars 11 forks source link

ANSI encoding does not have necessary characters for German or French stimuli #103

Closed monique2208 closed 6 months ago

monique2208 commented 1 year ago

When presenting words I would like to add the full word into the HED string usingWord, Label/# so that the full stimulus can be found easily as well. But since HED uses ANSI encoding we run into an issue with the German and French datasets. The problem probably extends to many other languages as well. It would be good if HED accepted more characters.

VisLab commented 9 months ago

After HWG discussion -- I checked--- BIDS supports UTF-8 for tabular files and JSON so we are going to have to look into what needs to be changed long-term to support this in the HED tools. @IanCa @smakeig @happy5214

@monique2208 is UTF-8 sufficient for the language schema?

monique2208 commented 9 months ago

Yes, even Chinese characters can be encoded in utf-8, so we shouldn't run into any more issues in the future

VisLab commented 9 months ago

UTF-8 will only be able to be used for values -- not for the names of HED Tags..... will that be a problem?

On Tue, Feb 6, 2024 at 1:46 AM Monique Denissen @.***> wrote:

Yes, even Chinese characters can be encoded in utf-8, so we shouldn't run into any more issues in the future

— Reply to this email directly, view it on GitHub https://github.com/hed-standard/hed-schemas/issues/103#issuecomment-1928948433, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJCJOTB6LYQA566GDNFIBTYSHNWTAVCNFSM6AAAAAA2EEUTT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRYHE2DQNBTGM . You are receiving this because you commented.Message ID: @.***>

monique2208 commented 9 months ago

No that will be enough

VisLab commented 9 months ago

@monique2208 Can you please upload a sample tsv file (events-type) and JSON sidecar that liberally use utf-8? This will be very helpful to us in assessing what we need to change to handle these.

VisLab commented 9 months ago

Based on this dicussion, revising the HED specification and the HEDTools and HED Javascript validator to accommodate UTF-8 values in .tsv columns and as substitutions for the # placeholder for tags that take a value in certain value classes.

IanCa commented 9 months ago

Where can utf8 appear? Sidecar category keys: Yes Sidecar category values: Yes (in tag values that allow) Sidecar value columns: Yes (in tag values that allow)

TSV tag columns: Yes (in tag values that allow) TSV category columns: Yes (if matches a category key) TSV value columns: Yes (again, if allowed)

Schema terms: Never Schema descriptions: Probably? Other schema section terms: Probably not(Units, unit classes, value classes, modifiers, attributes, properties) Prologue/Epilogue: Probably

@monique2208 @happy5214 @VisLab

Does anyone have any suggestions or corrections to the above?

happy5214 commented 9 months ago

One of my favorite examples of an edge case is how to handle non-USD currencies. How would you represent a currency sign like the euro (€), pound (₤), yen/yuan (¥), or won (₩)?

IanCa commented 8 months ago

So if we want to support currencies, a UTF8 symbol could appear in the following places in schemas: Prologue, Epilogue, Descriptions, and units.

I have no implementation concerns for it, it's just a question if we want to support it.

happy5214 commented 8 months ago

There is a unit modifier that is formally written with a non-ASCII character (micro-, written with a Greek lowercase mu). Allowing UTF-8 in modifiers could permit µ to be used as an alias to u.

IanCa commented 8 months ago

@VisLab

Just let me know if we want to support units/unit modifiers. I think the rest is pretty settled if we've fully determined we aren't supporting schema terms with UTF8.

VisLab commented 8 months ago

Units and unit modifiers should support UTF8. Value classes as well.

IanCa commented 8 months ago

Are column names(including in sidecars) allowed to include utf8? I'm assuming not.

Some of the ref code will need to be modified to support that if so.

Where can utf8 appear? Sidecar column names: Probably not Sidecar category keys: Yes Sidecar category values: Yes (in tag values that allow) Sidecar value columns: Yes (in tag values that allow)

TSV tag columns: Yes (in tag values that allow) TSV category columns: Yes (if matches a category key) TSV value columns: Yes (again, if allowed)

Schema terms: Never Schema descriptions: Probably Other schema section terms: No(unit classes, value classes, attributes, properties) Yes: (Units, unit modifiers) Prologue/Epilogue: Probably

VisLab commented 8 months ago

Sidecar column names correspond to TSV column names and might be utf-8

On Wed, Mar 6, 2024 at 5:39 PM IanCa @.***> wrote:

Are column names(including in sidecars) allowed to include utf8? I'm assuming not.

Some of the ref code will need to be modified to support that if so.

Where can utf8 appear? Sidecar column names: Probably not Sidecar category keys: Yes Sidecar category values: Yes (in tag values that allow) Sidecar value columns: Yes (in tag values that allow)

TSV tag columns: Yes (in tag values that allow) TSV category columns: Yes (if matches a category key) TSV value columns: Yes (again, if allowed)

Schema terms: Never Schema descriptions: Probably Other schema section terms: No(unit classes, value classes, attributes, properties) Yes: (Units, unit modifiers) Prologue/Epilogue: Probably

— Reply to this email directly, view it on GitHub https://github.com/hed-standard/hed-schemas/issues/103#issuecomment-1982044490, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJCJOSN6226OKTWQF2B56LYW6SKVAVCNFSM6AAAAAA2EEUTT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBSGA2DINBZGA . You are receiving this because you were mentioned.Message ID: @.***>

IanCa commented 8 months ago

Okay so to summarize, basically the only place we will NOT allow UTF8 is schema terms, unit classes, value classes, properties, and attributes.

VisLab commented 8 months ago

Yes, I think so ---

On Thu, Mar 7, 2024 at 11:08 AM IanCa @.***> wrote:

Okay so to summarize, basically the only place we will NOT allow UTF8 is schema terms, unit classes, value classes, properties, and attributes.

— Reply to this email directly, view it on GitHub https://github.com/hed-standard/hed-schemas/issues/103#issuecomment-1984024268, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJCJOS6KNZ3RTIBBQIWTSLYXCNKNAVCNFSM6AAAAAA2EEUTT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBUGAZDIMRWHA . You are receiving this because you were mentioned.Message ID: @.***>

happy5214 commented 8 months ago

Another edge case: how would you tag an image of a glass of piña colada (a randomly chosen object with an accented character that could be a tag extension)?

IanCa commented 8 months ago

We would use something like the (Word, Label/#) structure she suggested before, under the current guidelines.

If we allow extensions to be UTF8, it might be odd if we don't allow utf8 schema terms as well.

Programmatically all these options seem about equally easy to implement to me, so no concerns there.

IanCa commented 8 months ago

Per email from Kay, Allowing UTF8 in schema terms.

So this leaves only the following as never allowing UTF8 Other schema section terms: No(unit classes, value classes, attributes, properties)

happy5214 commented 8 months ago

So this leaves only the following as never allowing UTF8 Other schema section terms: No(unit classes, value classes, attributes, properties)

I mentioned this idea in https://github.com/hed-standard/hed-specification/pull/569#issuecomment-1988949016, but it may be more relevant here.

I think there's a fundamental distinction in the schema between definitions that actually appear in HED data (tags, units, unit modifiers) and "meta-definitions", or those that describe other definitions in the schema (schema attributes, unit classes, value classes, and properties). The latter types are effectively constant keywords[...]

The "meta-definition" names are not going to be used directly by taggers (their primary role is to serve as structure/metadata for the schema), so UTF-8 is unnecessary.

IanCa commented 8 months ago

We'd have to decide how we want to compare foreign characters without case if we're allowing them as extensions or tag terms.

Python has .casefold(), which is a "more aggressive" version of .lower().

In my quick testing, the .lower/upper handles most weird characters now(umlauts and such). The following is an example of one that has different results:

("ß", "ß".lower(), "ß".casefold()), 

Returns: ('ß', 'ß', 'ss'),

happy5214 commented 8 months ago

We'd have to decide how we want to compare foreign characters without case if we're allowing them as extensions or tag terms.

Python has .casefold(), which is a "more aggressive" version of .lower().

In my quick testing, the .lower/upper handles most weird characters now(umlauts and such). The following is an example of one that has different results:

("ß", "ß".lower(), "ß".casefold()), 

Returns: ('ß', 'ß', 'ss'),

For that particular character, casefold is definitely required, as Switzerland and Liechtenstein use "ss" instead of "ß", while other German dialects continue to use the older ligature.

VisLab commented 6 months ago

utf-8 is now allowed. Both the JavaScript and Python validators that are about to be released support it.