CSV report being with wrong charset

lineofflight / peddler

Amazon Selling Partner API (SP-API) in Ruby

https://lineofflight.github.io/peddler/

MIT License

307 stars 130 forks source link

CSV report being with wrong charset #128

Closed JapSeyz closed 5 years ago

JapSeyz commented 5 years ago

Hi,

When running report = client.get_report(report_id)

report.headers show a charset=UTF-8 header, however running report.body.encoding returns #<Encoding:Windows-1252>

Running report.parse seems to try to parse the file, but fails to do so accurately as it turns German characters into an UTF-8 string with ISO-8859-1 bytes.

Input: schwarze Metallfüße Parsed: schwarze MetallfÃ¼ÃŸe Parsed bytes: [195, 131, 194, 188, 195, 131, 197, 184]

Looking at an ISO 8859-1 table here: https://cs.stanford.edu/people/miles/iso8859.html, we see that byte 195 is a capital A with a tilde, and 188 is a fraction.

What's the best way to fix this, as it seems to be a conversion that's happening inside .parse

Cheers

hakanensari commented 5 years ago

@JapSeyz so I can take a closer look, what's the report type?

JapSeyz commented 5 years ago

@hakanensari Yeah for sure, it's a _GET_CONVERGED_FLAT_FILE_ORDER_REPORT_DATA_, spanning back three days worth of orders.

Cheers

JapSeyz commented 5 years ago

Here's some example data I received:

Original from Amazon: Jürgen

Parsed: JÃ¼rgen

string.encoding: UTF-8

Bytes: 74 195 131 194 188 114 103 101 110

hakanensari commented 5 years ago

@JapSeyz I did reproduce the bug and will have to refactor parsing. My old assumptions about how Amazon encodes seem to no longer hold. I'm also not sure why I was not using the charset returned in the header. This was possibly not available when I originally worked on the parser 🤔

In any case, until I refactor—

parser = client.get_report(id)
parser.content.force_encoding(Encoding::UTF_8) # this should fix for now?
parser.parse

JapSeyz commented 5 years ago

Unfortunately that doesn't fix it.

As far as I can see, it's ASCII-8bit when it comes from Amazon, then it's being force-encoded to CP1252, and then encoded to UTF-8.

I am not sure that force_encoding is destructive, ie does it overwrite the original string? here it says it returns the force_encoded string: https://ruby-doc.org/core-2.1.0/String.html#method-i-force_encoding and parser doesn't have a setter for content= so I can't set it programatically.

Either way parser.content.force_encoding(Encoding::UTF_8) doesn't work, unfortunately.

JapSeyz commented 5 years ago

@hakanensari I am also getting issues with replacement characters in Spanish names: VÃ�CTOR

So far I've resulted to gsub the known replacements, but it's not really feasible at scale.

Cheers

hakanensari commented 5 years ago

@JapSeyz thanks for reporting this. I just pushed a fix and will bump version momentarily.

To recap the story for future reference, Amazon seems to have silently changed its encoding behaviour in the Reports API, breaking our flat file parsing. They used to encode files in a variety of encodings and not advertise the latter explicitly. Now they encode (possibly) everything in UTF-8 and return the encoding in the response header.

JapSeyz commented 5 years ago

Hi @hakanensari The update seems to have fixed the encoding issues