mergestat / syncs

MergeStat container based syncs
MIT License
5 stars 10 forks source link

UTF encoding error in git_files and mergestat_explore syncs #61

Closed amenowanna closed 1 year ago

amenowanna commented 1 year ago

example error: DEBUG: ERROR: invalid byte sequence for encoding "UTF8": 0xf5 0xa0 0x87 0x9 repro to reproduce: https://github.com/vectordotdev/vector

We have previously addressed this problem in #55 but it appears this didn't solve the whole problem.

amenowanna commented 1 year ago

@patrickdevivo @riyaz-ali I have not been able to find a better solution so far. The encoding of files.csv shows as being utf when I run file -bi files.csv within the sync container. However I am finding mixed reports on converting from the same encoding like we are doing now. So what I was able to get working was to swap between encodings. I first encoded to utf16 and then back to utf8 and then I was able to get the copy command to work. Would like to get your thoughts on this thought.

iconv -f UTF-8 -t UTF-16 -c < files.csv > files_utf16.csv
iconv -f UTF-16 -t UTF-8 -c < files_utf16.csv > files_utf8.csv