gitonthescene / csv-reconcile

A reconciliation service for OpenRefine serving data from a given CSV file.
MIT License
70 stars 8 forks source link

Incorrect encoding detection #41

Closed jmacura closed 2 years ago

jmacura commented 3 years ago

Hello,

at first, let me thank you for this great reconciliation tool!

I've been trying to use csv-reconcile with the csv-reconcile-geo plugin and I am not confident, where the error comes from so feel free to direct me elsewhere, if the problem does not occur at your site.

So, the problem was that the "budovy_wdqs.tsc" file I was providing was using UTF-8 encoding, while the program apparently expect it to be in cp-1250 for some reason. When I have resaved the .tsv in cp-1250, the bug went gone.

(venv) C:\...\csv-reconcile [master ≡ +4 ~0 -0 !]> csv-reconcile --init-db budovy_wdqs.tsv item coords --scorer geo --config config.txt
Traceback (most recent call last):
  File "C:\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
    exec(code, run_globals)
  File "C:\...\csv-reconcile\venv\Scripts\csv-reconcile.exe\__main__.py", line 7, in <module>
  File "C:\...\csv-reconcile\venv\lib\site-packages\click\core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "C:\...\csv-reconcile\venv\lib\site-packages\click\core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "C:\...\csv-reconcile\venv\lib\site-packages\click\core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\...\csv-reconcile\venv\lib\site-packages\click\core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "C:\...\csv-reconcile\venv\lib\site-packages\csv_reconcile\__init__.py", line 195, in main
    initdb.init_db_with_context()
  File "C:\...\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 90, in init_db_with_context
    return init_db(db,
  File "C:\...\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 58, in init_db
    header = next(reader)
  File "C:\Python310\lib\encodings\cp1250.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x88 in position 2094: character maps to <undefined>
gitonthescene commented 3 years ago

Hi there,

It looks like you need to specify the encoding in a config file. Just add the --config option with a file which contains the following:

CSVENCODING=<encoding>

where <encoding> is replaced with the encoding you need. I’m assuming you want cp1250 but possibly utf-8.

I’m not 100% sure what’s happening from your description. If you attach the first few lines of the file I can try it out for myself. Perhaps this is a system default? This is all handled by Python’s csv module. The config allows you to supply an explicit encoding to use.

Please let me know if this helps.

Regards

gitonthescene commented 3 years ago

FWIW, Google turned up this description of how to determine the file encoding that might be worth trying.

b2m commented 3 years ago

At the current version (0.3.0) csv-reconcile doesn't try to guess the encoding of a CSV file.

There is a separate python library for that called chardet. It is already in the depenency tree of csv-reconcile as it is a direct dependency of normality.

It may be worth a try to guess the encoding of a file when no user specific encoding is given.

There is also the csv.Sniffer class that helps detecting the correct delimiter without relying on user parameters for every deviation from the defaults.

gitonthescene commented 3 years ago

@b2m - Thanks for the tips. I’ll have a look. I don’t believe the last release did either unless something changed in Python’s csv module.

b2m commented 3 years ago

I don’t believe the last release did either [...]

Exactly, the comment was meant as tips for improvement of the usability of csv-reconcile to avoid most of the csv encoding/reading problems =)

jmacura commented 3 years ago

It looks like you need to specify the encoding in a config file. Just add the --config option with a file which contains the following

Oh, thank you! I wasn't aware of this config option. I guess this closes this issue.

If you attach the first few lines of the file I can try it out for myself. Perhaps this is a system default?

The file is not large, I am attaching it whole in its original form (i.e., before I re-encoded it). query.zip

Perhaps the right question here is, what is the default encoding csv-reconcile is expecting in the .tsv file? For me, it was apparently cp-1250, but this must be wrong for the vast majority of files, so could it be platform-dependent?

gitonthescene commented 3 years ago

Thanks. I’ll try to work in @b2m’s tips to auto-discover the encoding, but for now you should be able to use the override. Would you please let me know if you’re up and running using the override so I can close this issue?

Also, thanks for the file. I’ll use it to test the suggested features.

gitonthescene commented 3 years ago

@jmacura FWIW, I did check that the tsv in the file you posted is using utf-8. I'm not sure why your system thought it should be encoded cp1250. In any event, I implemented @b2m's suggestions above to add encoding detection and that will be in the next release. This issue will close once that gets merged back into master.

jmacura commented 3 years ago

@gitonthescene Great, thank you for this improvement! Beside that, I can confirm that appending a line CSVENCODING = "utf-8" into the config.txt (and --config config.txt) does work around the problem as well. Thank you for the hint.

woody544 commented 2 years ago

FYSA: Windows uses the cp1250 encoding, which can cause hiccups like this, and I ran into this problem with csv-reconcile as well.

I had solved it before I saw the above issue/solution, by opening the file in the text editor, and saving the reps.tsv file as 'UTF-8 with BOM'. However, I expect changing the configuration file as suggested in the issue is a more robust and lasting solution.

gitonthescene commented 2 years ago

FWIW, this has gone out in the latest release.