Closed jmacura closed 2 years ago
Hi there,
It looks like you need to specify the encoding in a config file. Just add the --config
option with a file which contains the following:
CSVENCODING=<encoding>
where <encoding>
is replaced with the encoding you need. I’m assuming you want cp1250
but possibly utf-8
.
I’m not 100% sure what’s happening from your description. If you attach the first few lines of the file I can try it out for myself. Perhaps this is a system default? This is all handled by Python’s csv module. The config allows you to supply an explicit encoding to use.
Please let me know if this helps.
Regards
FWIW, Google turned up this description of how to determine the file encoding that might be worth trying.
At the current version (0.3.0) csv-reconcile doesn't try to guess the encoding of a CSV file.
There is a separate python library for that called chardet. It is already in the depenency tree of csv-reconcile as it is a direct dependency of normality.
It may be worth a try to guess the encoding of a file when no user specific encoding is given.
There is also the csv.Sniffer class that helps detecting the correct delimiter without relying on user parameters for every deviation from the defaults.
@b2m - Thanks for the tips. I’ll have a look. I don’t believe the last release did either unless something changed in Python’s csv module.
I don’t believe the last release did either [...]
Exactly, the comment was meant as tips for improvement of the usability of csv-reconcile to avoid most of the csv encoding/reading problems =)
It looks like you need to specify the encoding in a config file. Just add the --config option with a file which contains the following
Oh, thank you! I wasn't aware of this config option. I guess this closes this issue.
If you attach the first few lines of the file I can try it out for myself. Perhaps this is a system default?
The file is not large, I am attaching it whole in its original form (i.e., before I re-encoded it). query.zip
Perhaps the right question here is, what is the default encoding csv-reconcile is expecting in the .tsv file? For me, it was apparently cp-1250, but this must be wrong for the vast majority of files, so could it be platform-dependent?
Thanks. I’ll try to work in @b2m’s tips to auto-discover the encoding, but for now you should be able to use the override. Would you please let me know if you’re up and running using the override so I can close this issue?
Also, thanks for the file. I’ll use it to test the suggested features.
@jmacura FWIW, I did check that the tsv in the file you posted is using utf-8
. I'm not sure why your system thought it should be encoded cp1250
. In any event, I implemented @b2m's suggestions above to add encoding detection and that will be in the next release. This issue will close once that gets merged back into master.
@gitonthescene Great, thank you for this improvement! Beside that, I can confirm that appending a line CSVENCODING = "utf-8"
into the config.txt (and --config config.txt
) does work around the problem as well. Thank you for the hint.
FYSA: Windows uses the cp1250
encoding, which can cause hiccups like this, and I ran into this problem with csv-reconcile as well.
I had solved it before I saw the above issue/solution, by opening the file in the text editor, and saving the reps.tsv file as 'UTF-8 with BOM'. However, I expect changing the configuration file as suggested in the issue is a more robust and lasting solution.
FWIW, this has gone out in the latest release.
Hello,
at first, let me thank you for this great reconciliation tool!
I've been trying to use csv-reconcile with the csv-reconcile-geo plugin and I am not confident, where the error comes from so feel free to direct me elsewhere, if the problem does not occur at your site.
So, the problem was that the "budovy_wdqs.tsc" file I was providing was using UTF-8 encoding, while the program apparently expect it to be in cp-1250 for some reason. When I have resaved the .tsv in cp-1250, the bug went gone.