Problem with UTF-8 encoded submissions

mxhdev / SQLChecker

GNU General Public License v3.0

1 stars 1 forks source link

Problem with UTF-8 encoded submissions #20

Closed timoei closed 8 years ago

timoei commented 8 years ago

The file reader isn't able to read submissions which are encoded in UTF-8 correctly. On the first line of the text file a ï»¿ is added. For that reason the first tag can't be recognized. See a example submission (utf8Bug) attached. This bug doesn't happen if the submission is encoded in 'UTF-8 without BOM'. But all german 'Umlaute' aren't recognized as well.

Possible solution: Before reading encode all submissions to ANSI. utf8Bug.txt

mxhdev commented 8 years ago

Possibly helpful links:

http://stackoverflow.com/questions/6998905/java-bufferedwriter-object-with-utf-8 http://www.mkyong.com/java/how-to-read-utf-8-encoded-data-from-a-file-java/

mxhdev commented 8 years ago

Fixed in commit 35e0e77463e8e0199281a7b741d99dbf2f4028a9

Solution

This problem was solved by stripping all non-ASCII characters from the test string before performing any checks. This is now another step of the string normalization process which happens before the actual tag-detection checks

Alternative Solution

This could have also be solved with an indexOf(PREFIX | SUFFIX) >= 0 command. While the startsWith / endsWIth did sometimes fail, the indexOf statement did still work