semanticarts / ontology-toolkit

Tools to update and export ontology RDF.
Apache License 2.0
34 stars 6 forks source link

Bunding Fails under Windows on .md file with unrecognized Unicode characters. #110

Open stevenchalem opened 1 year ago

stevenchalem commented 1 year ago

.md files containing certain Unicode characters (e.g. left double quote and right double quote, U+201C and U+201D) cause the bundling process to fail. For example when the new Namespace.md file was added to gist it contained such characters and the bundling process failed.

sa-bpelakh commented 1 year ago

Were the Unicode characters in the source encoded? e.g Markdown expects U+201C to become “ (decimal instead of hex value).

Jamie-SA commented 1 year ago

@sa-bpelakh see this PR for an example with an apostrophe that didn't convert correctly: https://github.com/semanticarts/gist/pull/920/files

It looks like it is a 3 byte character: e2 80 99

Jamie-SA commented 1 year ago

Here is the PR to fix the characters that this issue was originally reported on: https://github.com/semanticarts/gist/pull/868/files

This seemed to use 3 byte quote characters of e2 80 9c and e2 80 9d You can get a copy of the old version of the file and see the output with: ''' git checkout bcba46dbda88c6a0eb02e2441725bcf63cecd3e9 head -55 docs/Namespace.md | tail -10 | od -xacb --endian=big ''' Look for the 'nul' in the output.

sa-bpelakh commented 1 year ago

Here is the PR to fix the characters that this issue was originally reported on: https://github.com/semanticarts/gist/pull/868/files

This seemed to use 3 byte quote characters of e2 80 9c and e2 80 9d You can get a copy of the old version of the file and see the output with: ''' git checkout bcba46dbda88c6a0eb02e2441725bcf63cecd3e9 head -55 docs/Namespace.md | tail -10 | od -xacb --endian=big ''' Look for the 'nul' in the output.

Oh, I understand what happened. The way I see it, our Markdown inputs should be utf-8 compliant. However

  1. This expectation should be documented. This is also true, btw, of our RDF inputs, which may be an issue at some point if we decide to internationalize
  2. If the utf-16 input is unintentional, we should fail more gracefully, and
  3. If the utf-16 is intentional, the users should use the &#... encoding I described above.