linkml / linkml

Linked Open Data Modeling Language
https://linkml.io/linkml
Other
297 stars 91 forks source link

gen-project fails with UnicodeEncodeError (Win10, De) #601

Closed dalito closed 1 year ago

dalito commented 2 years ago

Describe the bug

When following part 8 of the tutorial on a German Windows-10 PC, the generation of the project failed with a UnicodeEncodeError (see traceback).

To Reproduce Steps to reproduce the behavior:

  1. Follow part8 of the introduction
  2. Run gen-project -d personinfo/ personinfo.yaml

Traceback

(.venv) PS C:\Users\dlinke\MyProg_exp040\linkml\part8> gen-project -d personinfo/ personinfo.yaml
ALL_SCHEMAS = ['personinfo.yaml']
INFO:root:Generating: graphql
...
INFO:root:Generating: markdown
INFO:root: SCHEMA: personinfo.yaml
INFO:root: PARENT=personinfo//docs
INFO:root: ARGS: {'mergeimports': True, 'directory': 'personinfo//docs', 'index_file': 'personinfo.md'}
Traceback (most recent call last):
  File "C:\dev\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\dev\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\dlinke\MyProg_exp040\linkml\.venv\Scripts\gen-project.exe\__main__.py", line 7, in <module>
  File "C:\Users\dlinke\MyProg_exp040\linkml\.venv\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\dlinke\MyProg_exp040\linkml\.venv\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "C:\Users\dlinke\MyProg_exp040\linkml\.venv\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\dlinke\MyProg_exp040\linkml\.venv\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "C:\Users\dlinke\MyProg_exp040\linkml\.venv\lib\site-packages\linkml\generators\projectgen.py", line 191, in cli
    gen.generate(yamlfile, project_config)
  File "C:\Users\dlinke\MyProg_exp040\linkml\.venv\lib\site-packages\linkml\generators\projectgen.py", line 114, in generate
    gen_dump = gen.serialize(**serialize_args)
  File "C:\Users\dlinke\MyProg_exp040\linkml\.venv\lib\site-packages\linkml\utils\generator.py", line 116, in serialize
    self.visit_schema(**kwargs)
  File "C:\Users\dlinke\MyProg_exp040\linkml\.venv\lib\site-packages\linkml\generators\markdowngen.py", line 86, in visit_schema
    self.pred_hier(slot)
  File "C:\Users\dlinke\MyProg_exp040\linkml\.venv\lib\site-packages\linkml\generators\markdowngen.py", line 396, in pred_hier
    self.bullet(self.slot_link(slot, use_desc=True), level)
  File "C:\Users\dlinke\MyProg_exp040\linkml\.venv\lib\site-packages\linkml\generators\markdowngen.py", line 505, in bullet
    print(f'{"    " * level} * {txt}')
  File "C:\dev\Python310\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u279e' in position 4: character maps to <undefined>
(.venv) PS C:\Users\dlinke\MyProg_exp040\linkml\part8>

Desktop (if applicable, please complete the following information):

Additional context

The unicode character that makes the problem is an "arrow" in markdowgen.py.

Since a context manager is used to redirect stdout/print-statements to a file, the encoding for stdout plays a role. This is not utf-8 on Windows which causes the error.

The use of the context manager "redirect_stdout" prevent also the use of the pdb-debugger. If I insert import pdb; pdb.set_trace() before the error to enter interactive debugging, the code enters debugging mode but I cannot interact with the debugger due to stdout being redirected.

If I replace the problematic character, model generation runs fine.

dalito commented 2 years ago

The problem is that windows does not use utf-8 as default encoding, see PEP 540 or Inada Naokis summary. This is different from mac and linux.

To fix the bug the encoding must be explicitly set to utf8 for writing files in text mode. For example

with open(self.exist_warning(self.dir_path(cls)), 'w') as clsfile:

should be changed to

with open(self.exist_warning(self.dir_path(cls)), 'w', encoding='UTF-8') as clsfile:

This will affect several places in the code. I can prepare a PR tomorrow.

cmungall commented 2 years ago

reopened as I think some new tests were incorporated that don't do this

does anyone have any ideas how to check for this with gh actions, otherwise we always run the risk of reintroducing this

sierra-moxon commented 2 years ago

what if we pull this file opening into a method? tests would still have to use the method, but then we wouldn't be doing it several times?

cmungall commented 2 years ago

I think that makes sense. We mostly go via methods anyway, e.g. when loading or dumping linkml objects. But there are various times in tests where we want to do ad-hoc loading or dumping, often via plain yaml/json libs.

But maybe it's not an issue for the tests so long as all our test files are properly encoded and restricted to ascii unless otherwise required?

dalito commented 2 years ago

@sierra-moxon Where would you put the method, in generator.py, class Generator?

dalito commented 2 years ago

647 added a function write_to_file that could be used to avoid repeating the code several times. Let me know if you want me to go through the changes made in #607 and use this new function instead.

dalito commented 2 years ago

This issue may be closed or at least re-labeled: It is not a bug anymore. Only the code refactoring to use write_to_file is left to do. But I feel changing existing correct code is not worth it since the function also adds one level of redirection and therefore complexity.