lexibank / pylexibank

The python curation library for lexibank
Apache License 2.0
17 stars 7 forks source link

Weird error while running makecldf #272

Closed MuffinLinwist closed 7 months ago

MuffinLinwist commented 11 months ago

I've been having problems while running the CLDF conversion on different datasets. The content on cldf/requirements.txt file is erased. I know it's normal behaviour that the conversion automatically overwrites all the files whenever i run makecldf. However, the file appears empty when it shouldn't. Also, I get this error message but the conversion still runs and all the other files are okay. This seems to be happening only on my end. I tried deleting everything and redownloading it again but it does the same. I'm on Windows. Here is the command I run:

(venv) PS C:\Users\user\Documents\datasets\seifartecheverriboran> cldfbench lexibank.makecldf lexibank_seifartecheverriboran.py --concepticon-version=v3.1.0 --glottolog-version=v4.8 --clts-version=v2.2.0

And here is the output I get:

INFO    running _cmd_makecldf on seifartecheverriboran ...
INFO    added sources
INFO    added concepts
INFO    added languages
cldfify: 0it [00:00, ?it/s]2023-11-01 10:52:55,141 [WARNING] line 48:duplicate grapheme in profile: pai()u
INFO    file written: C:/Users/user/Documents/datasets/seifartecheverriboran/cldf/.transcription-report.json
INFO    Summary for dataset C:\Users\user\Documents\datasets\seifartecheverriboran\cldf\cldf-metadata.json
- **Varieties:** 3
- **Concepts:** 412
- **Lexemes:** 1,244
- **Sources:** 1
- **Synonymy:** 1.01
- **Invalid lexemes:** 0
- **Tokens:** 9,348
- **Segments:** 66 (1 BIPA errors, 1 CLTS sound class errors, 63 CLTS modified)
- **Inventory size (avg):** 41.33
INFO    file written: C:/Users/user/Documents/datasets/seifartecheverriboran/TRANSCRIPTION.md
INFO    file written: C:/Users/user/Documents/datasets/seifartecheverriboran/cldf/lingpy-rcParams.json
INFO    ... done seifartecheverriboran [150.4 secs]
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\user\Documents\datasets\venv\Scripts\cldfbench.exe\__main__.py", line 7, in <module>
  File "C:\Users\user\Documents\datasets\venv\Lib\site-packages\cldfbench\__main__.py", line 89, in main
    return args.main(args) or 0
           ^^^^^^^^^^^^^^^
  File "C:\Users\user\Documents\datasets\venv\Lib\site-packages\pylexibank\commands\makecldf.py", line 28, in run
    creators, contributors = dataset.get_creators_and_contributors(strict=False)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\Documents\datasets\venv\Lib\site-packages\pylexibank\dataset.py", line 100, in get_creators_and_contributors
    return metadata.get_creators_and_contributors(self.contributors_path, strict=strict)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\Documents\datasets\venv\Lib\site-packages\pylexibank\metadata.py", line 367, in get_creators_and_contributors
    for row in iter_rows(fname):
  File "C:\Users\user\Documents\datasets\venv\Lib\site-packages\pylexibank\metadata.py", line 388, in iter_rows
    for line in (fname_or_lines if isinstance(fname_or_lines, list) else fname_or_lines.open()):
  File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 145: character maps to <undefined>

Here is a the dataset example. Can I kindly ask you to give me some clarity on why this is happening?

johenglisch commented 11 months ago

Just to quickly add my first thoughts from when I looked at this a bit earlier:

This line from the stacktrace has an open() call that doesn't specify any character encoding:

https://github.com/lexibank/pylexibank/blob/bc570e1c6d50e9d8fad7c8544dc07f69657f8655/src/pylexibank/metadata.py#L388

And the exception itself seems to come from the cp1252 decoder. So maybe we're being bitten by Python not using UTF-8 by default on Windows?

LinguList commented 11 months ago

Looks like that, @johenglisch.

chrzyki commented 11 months ago

Just to quickly add my first thoughts from when I looked at this a bit earlier:

This line from the stacktrace has an open() call that doesn't specify any character encoding:

https://github.com/lexibank/pylexibank/blob/bc570e1c6d50e9d8fad7c8544dc07f69657f8655/src/pylexibank/metadata.py#L388

And the exception itself seems to come from the cp1252 decoder. So maybe we're being bitten by Python not using UTF-8 by default on Windows?

Yep, we've had that issue before in a different context: https://github.com/concepticon/pyconcepticon/issues/10

LinguList commented 11 months ago

Can it be patched quickly?

chrzyki commented 11 months ago

Should be doable, yes. I'll set up a test environment and have a look at that.

Meanwhile, @MuffinLinwist would you mind giving the fix outlined in https://github.com/concepticon/pyconcepticon/issues/10 a try? I.e. in the Windows command prompt, before running cldfbench, set a temporary default encoding for Python with:

set PYTHONIOENCODING=utf-8

(Please note that this is a temporary environmental variable, i.e. this is not persisted after closing the command prompt.)

MuffinLinwist commented 10 months ago

Thanks @everyone for your comments. Now the error message does not appear (and it takes also way less time to run the conversion) but the requirements file still gets empty. Do you have an idea why could this be happening?

chrzyki commented 10 months ago

Thanks @everyone for your comments. Now the error message does not appear (and it takes also way less time to run the conversion) but the requirements file still gets empty. Do you have an idea why could this be happening?

I've tried addressing both issues in https://github.com/lexibank/pylexibank/pull/273 and https://github.com/cldf/cldfbench/pull/92, respectively. The PRs might not be final. I'll let you know once the issues are fixed completely.

chrzyki commented 10 months ago

@MuffinLinwist Both PRs have been merged, if you install both packages from source you can check whether this fixed the issue for you (in a venv):

$ (my_venv) pip install git+https://github.com/cldf/cldfbench.git@9ff9fe91331030309c53d55efc36f76deb516e0f
$ (my_venv) pip install git+https://github.com/lexibank/pylexibank.git@1a52624a97f371fb6e61cc44014b1dbf0a03c142
chrzyki commented 7 months ago

@MuffinLinwist Have you had a chance to test the fixes?

MuffinLinwist commented 7 months ago

@MuffinLinwist Have you had a chance to test the fixes?

I couldn't because my laptop broke down and I had to get a new one. So, I'm closing this issue, thanking @all of you for your assistance on it :)