karussell / snacktory

Readability clone in Java
461 stars 159 forks source link

ensure asian characters are not broken #5

Open karussell opened 12 years ago

karussell commented 12 years ago

This is now fixed! But needs a unit test!

From email:

The issue is in Converter.streamToString(). There's a loop to read http data chunks. Each chunk is converted separately to String, but may contain only the first (or seconf) half of a character, thus result in corrupted data. It happens sporadically depending on timing.

Also, the counting of bytesRead was wrong, so for slow connection there may be a "size exceeded" message with no justification.

What I did to test this problem is reading a Japanese article (url below) with the Browser, save its content somewhere (e.g. on file). Then run the streamToString() function in a loop (with some delay) and each time compare its output with the expected output on file. Sometimes I experienced dozens successful tests and then several failures, so this is not too persistent but the errors were often enough.

The article I tested on is http://astand.asahi.com/magazine/wrscience/2012022900015.html, and the corruption was almost always visible in the string "300" (see in the article), where instead of the "3" some junk was displayed.

karussell commented 12 years ago

see https://github.com/karussell/snacktory/commit/09c48a362c3652c2296e252b4cda42f13ed4aad7