globalwordnet / english-wordnet

The Open English WordNet
https://en-word.net/
Other
442 stars 52 forks source link

Change quoting-closing apostrophe to ´ #1026

Open 1313ou opened 2 weeks ago

1313ou commented 2 weeks ago
  1. Change quotation scheme from `quoted' to `quoted´
  2. Also includes some fixes along the way (punctuation, aligned phoneme notation, typos,...)
  3. Output is normalized YAML to avoid non-deterministic changes (the YAML-dumping process used has been fixed to be idempotent).

Apostrophe used as a mark to close a quotation is ambiguous and makes quotations very difficult to parse. Take for example:

He's too nice, that's his `Achilles' heel.

He's too nice, that's his `Achilles' heel'.

You don't know if the quotation ends after Achilles or at the end of the sentence until you reach it. This makes it difficult for processing tools to extract and style this quotation for example.

Instead of multiplexing the apostrophe character, I suggest using a dedicated character (´) to close quotations. It's ASCII (0x00B4) and mirrors the backtick/grave accent (`).

Putting an end to this multiplexing requires sorting current uses of the apostrophe into 1) omission of character (elision, contraction, possessive ...) and 2) quotation ending. This is what is done here and thus affects only the latter use.

This change is easily reversible by automatic character substitution.

It opens the way to other quotation schemes (by automatic character substitution):

‟double quoted” “double quoted” „double quoted low” ❛heavy quoted❜ ❟heavy quoted low❜ ❝heavy double quoted❞ ❠heavy double quoted low❠ «guillemet»

Added to that the YAML is simpler: fewer are the instances where apostrophes in YAML have to be escaped.

jmccrae commented 2 weeks ago

Thanks for this, I like the idea, but I wonder if the choice of characters are the best to implement this.

Introducing a lot of non-ASCII characters can cause issues, in particular in that I am still not 100% sure how well legacy apps that use the WNDB format can cope with these characters, so I would like to test it a bit more.

Secondly, if we do use Unicode characters for punctuation, wouldn't it be more appropriate to use U+2018 and U+2019 for quotes, rather then U+00B4 which is officially called "Acute Accent"

1313ou commented 2 weeks ago

I don't think non-ascii will pose a problem nowadays. All modern languages and libraries will handle this seamlessly. The legacy applications that will stumble on non-ascii characters are likely to stumble on :

`¬, °, ·, ×, ⁓, −, ∞, ̃, €, ½, á, à, ä, ā, ç, é, É, ê, ë, fi, ʰ, ʻ, í, Ḳ, ñ, ó, ò, ö, ő, ś, š, ú, ü, ű, ū, α, β, γ, ρ, ъ, Ъ, ь, Ь,

not to mention the em-dash and ellipsis, all of which have already been imported in OEWN.

As for the choice of quoting characters, I agree with you that ‘quoted’ with ‘ (u2018) and ’ (u2019) for quotes would be more appropriate. Or “quoted” with “ (u201C) and ” (u201D) double quotation marks.

Actually that was my first choice but I fell back on the "Acute Accent" (u00B4) because

If you are open to the u2018-u2019 move (yielding ‘quoted’), so am I. I can easily adjust the PR to do just this.

fcbond commented 2 weeks ago

I would also prefer “ (u201C) and ” (u201D), as people are less likely to confuse them with apostrophes.

On Thu, 4 Jul 2024 at 09:17, Bernard Bou @.***> wrote:

I don't think non-ascii will pose a problem nowadays. All modern languages and libraries will handle this seamlessly. The legacy applications that will stumble on non-ascii characters are likely to stumble on :

`¬, °, ·, ×, ⁓, −, ∞, ̃, €, ½, á, à, ä, ā, ç, é, É, ê, ë, fi, ʰ, ʻ, í, Ḳ, ñ, ó, ò, ö, ő, ś, š, ú, ü, ű, ū, α, β, γ, ρ, ъ, Ъ, ь, Ь,

not to mention the em-dash and ellipsis, all of which have already been imported in OEWN.

As for the choice of quoting characters, I agree with you that ‘quoted’ with ‘ (u2018) and ’ (u2019) for quotes would be more appropriate. Or “quoted” with “ (u201C) and ” (u201D) double quotation marks.

Actually that was my first choice but I fell back on the "Acute Accent" (u00B4) because

  • it is Extended Ascii, coded on one byte
  • the move is less extensive: you just replace the closing mark
  • it is more conservative: the backtick stays in place so that code that spots quotations with this will still work
  • the backtick for opening is actually the "Grave Accent" and the acute accent and grave accent are in a mirror relation.

If you are open to the u2018-u2019 move (yielding ‘quoted’), so am I. I can easily adjust the PR to do just this.

— Reply to this email directly, view it on GitHub https://github.com/globalwordnet/english-wordnet/pull/1026#issuecomment-2208280998, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRSMVMWRZZQO64XIJO3ZKTZIPAVCNFSM6AAAAABKEDKUFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBYGI4DAOJZHA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Francis Bond https://fcbond.github.io/