Closed JapSeyz closed 5 years ago
@JapSeyz so I can take a closer look, what's the report type?
@hakanensari Yeah for sure, it's a _GET_CONVERGED_FLAT_FILE_ORDER_REPORT_DATA_
, spanning back three days worth of orders.
Cheers
Here's some example data I received:
Original from Amazon: Jürgen
Parsed: Jürgen
string.encoding: UTF-8
Bytes: 74 195 131 194 188 114 103 101 110
@JapSeyz I did reproduce the bug and will have to refactor parsing. My old assumptions about how Amazon encodes seem to no longer hold. I'm also not sure why I was not using the charset returned in the header. This was possibly not available when I originally worked on the parser 🤔
In any case, until I refactor—
parser = client.get_report(id)
parser.content.force_encoding(Encoding::UTF_8) # this should fix for now?
parser.parse
Unfortunately that doesn't fix it.
As far as I can see, it's ASCII-8bit when it comes from Amazon, then it's being force-encoded to CP1252, and then encoded to UTF-8.
I am not sure that force_encoding is destructive, ie does it overwrite the original string? here it says it returns the force_encoded string: https://ruby-doc.org/core-2.1.0/String.html#method-i-force_encoding and parser doesn't have a setter for content= so I can't set it programatically.
Either way parser.content.force_encoding(Encoding::UTF_8)
doesn't work, unfortunately.
@hakanensari I am also getting issues with replacement characters in Spanish names: V�CTOR
So far I've resulted to gsub the known replacements, but it's not really feasible at scale.
Cheers
@JapSeyz thanks for reporting this. I just pushed a fix and will bump version momentarily.
To recap the story for future reference, Amazon seems to have silently changed its encoding behaviour in the Reports API, breaking our flat file parsing. They used to encode files in a variety of encodings and not advertise the latter explicitly. Now they encode (possibly) everything in UTF-8 and return the encoding in the response header.
Hi @hakanensari The update seems to have fixed the encoding issues
Hi,
When running
report = client.get_report(report_id)
report.headers
show acharset=UTF-8
header, however runningreport.body.encoding
returns#<Encoding:Windows-1252>
Running
report.parse
seems to try to parse the file, but fails to do so accurately as it turns German characters into an UTF-8 string with ISO-8859-1 bytes.Input:
schwarze Metallfüße
Parsed:schwarze Metallfüße
Parsed bytes: [195, 131, 194, 188, 195, 131, 197, 184]Looking at an ISO 8859-1 table here: https://cs.stanford.edu/people/miles/iso8859.html, we see that byte 195 is a capital A with a tilde, and 188 is a fraction.
What's the best way to fix this, as it seems to be a conversion that's happening inside .parse
Cheers