Closed monique2208 closed 6 months ago
After HWG discussion -- I checked--- BIDS supports UTF-8 for tabular files and JSON so we are going to have to look into what needs to be changed long-term to support this in the HED tools. @IanCa @smakeig @happy5214
@monique2208 is UTF-8 sufficient for the language schema?
Yes, even Chinese characters can be encoded in utf-8, so we shouldn't run into any more issues in the future
UTF-8 will only be able to be used for values -- not for the names of HED Tags..... will that be a problem?
On Tue, Feb 6, 2024 at 1:46 AM Monique Denissen @.***> wrote:
Yes, even Chinese characters can be encoded in utf-8, so we shouldn't run into any more issues in the future
— Reply to this email directly, view it on GitHub https://github.com/hed-standard/hed-schemas/issues/103#issuecomment-1928948433, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJCJOTB6LYQA566GDNFIBTYSHNWTAVCNFSM6AAAAAA2EEUTT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRYHE2DQNBTGM . You are receiving this because you commented.Message ID: @.***>
No that will be enough
@monique2208 Can you please upload a sample tsv file (events-type) and JSON sidecar that liberally use utf-8? This will be very helpful to us in assessing what we need to change to handle these.
Based on this dicussion, revising the HED specification and the HEDTools and HED Javascript validator to accommodate UTF-8 values in .tsv columns and as substitutions for the #
placeholder for tags that take a value in certain value classes.
Where can utf8 appear? Sidecar category keys: Yes Sidecar category values: Yes (in tag values that allow) Sidecar value columns: Yes (in tag values that allow)
TSV tag columns: Yes (in tag values that allow) TSV category columns: Yes (if matches a category key) TSV value columns: Yes (again, if allowed)
Schema terms: Never Schema descriptions: Probably? Other schema section terms: Probably not(Units, unit classes, value classes, modifiers, attributes, properties) Prologue/Epilogue: Probably
@monique2208 @happy5214 @VisLab
Does anyone have any suggestions or corrections to the above?
One of my favorite examples of an edge case is how to handle non-USD currencies. How would you represent a currency sign like the euro (€), pound (₤), yen/yuan (¥), or won (₩)?
So if we want to support currencies, a UTF8 symbol could appear in the following places in schemas: Prologue, Epilogue, Descriptions, and units.
I have no implementation concerns for it, it's just a question if we want to support it.
There is a unit modifier that is formally written with a non-ASCII character (micro-, written with a Greek lowercase mu). Allowing UTF-8 in modifiers could permit µ
to be used as an alias to u
.
@VisLab
Just let me know if we want to support units/unit modifiers. I think the rest is pretty settled if we've fully determined we aren't supporting schema terms with UTF8.
Units and unit modifiers should support UTF8. Value classes as well.
Are column names(including in sidecars) allowed to include utf8? I'm assuming not.
Some of the ref code will need to be modified to support that if so.
Where can utf8 appear? Sidecar column names: Probably not Sidecar category keys: Yes Sidecar category values: Yes (in tag values that allow) Sidecar value columns: Yes (in tag values that allow)
TSV tag columns: Yes (in tag values that allow) TSV category columns: Yes (if matches a category key) TSV value columns: Yes (again, if allowed)
Schema terms: Never Schema descriptions: Probably Other schema section terms: No(unit classes, value classes, attributes, properties) Yes: (Units, unit modifiers) Prologue/Epilogue: Probably
Sidecar column names correspond to TSV column names and might be utf-8
On Wed, Mar 6, 2024 at 5:39 PM IanCa @.***> wrote:
Are column names(including in sidecars) allowed to include utf8? I'm assuming not.
Some of the ref code will need to be modified to support that if so.
Where can utf8 appear? Sidecar column names: Probably not Sidecar category keys: Yes Sidecar category values: Yes (in tag values that allow) Sidecar value columns: Yes (in tag values that allow)
TSV tag columns: Yes (in tag values that allow) TSV category columns: Yes (if matches a category key) TSV value columns: Yes (again, if allowed)
Schema terms: Never Schema descriptions: Probably Other schema section terms: No(unit classes, value classes, attributes, properties) Yes: (Units, unit modifiers) Prologue/Epilogue: Probably
— Reply to this email directly, view it on GitHub https://github.com/hed-standard/hed-schemas/issues/103#issuecomment-1982044490, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJCJOSN6226OKTWQF2B56LYW6SKVAVCNFSM6AAAAAA2EEUTT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBSGA2DINBZGA . You are receiving this because you were mentioned.Message ID: @.***>
Okay so to summarize, basically the only place we will NOT allow UTF8 is schema terms, unit classes, value classes, properties, and attributes.
Yes, I think so ---
On Thu, Mar 7, 2024 at 11:08 AM IanCa @.***> wrote:
Okay so to summarize, basically the only place we will NOT allow UTF8 is schema terms, unit classes, value classes, properties, and attributes.
— Reply to this email directly, view it on GitHub https://github.com/hed-standard/hed-schemas/issues/103#issuecomment-1984024268, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJCJOS6KNZ3RTIBBQIWTSLYXCNKNAVCNFSM6AAAAAA2EEUTT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBUGAZDIMRWHA . You are receiving this because you were mentioned.Message ID: @.***>
Another edge case: how would you tag an image of a glass of piña colada (a randomly chosen object with an accented character that could be a tag extension)?
We would use something like the (Word, Label/#) structure she suggested before, under the current guidelines.
If we allow extensions to be UTF8, it might be odd if we don't allow utf8 schema terms as well.
Programmatically all these options seem about equally easy to implement to me, so no concerns there.
Per email from Kay, Allowing UTF8 in schema terms.
So this leaves only the following as never allowing UTF8 Other schema section terms: No(unit classes, value classes, attributes, properties)
So this leaves only the following as never allowing UTF8 Other schema section terms: No(unit classes, value classes, attributes, properties)
I mentioned this idea in https://github.com/hed-standard/hed-specification/pull/569#issuecomment-1988949016, but it may be more relevant here.
I think there's a fundamental distinction in the schema between definitions that actually appear in HED data (tags, units, unit modifiers) and "meta-definitions", or those that describe other definitions in the schema (schema attributes, unit classes, value classes, and properties). The latter types are effectively constant keywords[...]
The "meta-definition" names are not going to be used directly by taggers (their primary role is to serve as structure/metadata for the schema), so UTF-8 is unnecessary.
We'd have to decide how we want to compare foreign characters without case if we're allowing them as extensions or tag terms.
Python has .casefold(), which is a "more aggressive" version of .lower().
In my quick testing, the .lower/upper handles most weird characters now(umlauts and such). The following is an example of one that has different results:
("ß", "ß".lower(), "ß".casefold()),
Returns: ('ß', 'ß', 'ss'),
We'd have to decide how we want to compare foreign characters without case if we're allowing them as extensions or tag terms.
Python has .casefold(), which is a "more aggressive" version of .lower().
In my quick testing, the .lower/upper handles most weird characters now(umlauts and such). The following is an example of one that has different results:
("ß", "ß".lower(), "ß".casefold()),
Returns: ('ß', 'ß', 'ss'),
For that particular character, casefold
is definitely required, as Switzerland and Liechtenstein use "ss" instead of "ß", while other German dialects continue to use the older ligature.
utf-8 is now allowed. Both the JavaScript and Python validators that are about to be released support it.
When presenting words I would like to add the full word into the HED string using
Word, Label/#
so that the full stimulus can be found easily as well. But since HED uses ANSI encoding we run into an issue with the German and French datasets. The problem probably extends to many other languages as well. It would be good if HED accepted more characters.