ghmo / google-refine

Automatically exported from code.google.com/p/google-refine
0 stars 0 forks source link

Importing TSV doesn't detect UTF-8, reinterpret command may not be working correctly anymore. #164

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?

1. Look at row 33 in attached file in text application that shows UTF-8 
diacritics for row 33 in Notepad++ or Textmate. (This attached file was an 
exported TSV file from Refine with latest trunk 1606) So far, so good.
2. Checkout and build trunk and run.
2. Create new project file from attached previously exported TSV file with 
defaults.
4. Diacritic characters are not displayed correctly for row 33 in Refine.

5. Furthermore, trying value.reinterpret("utf-8") or reinterpret(value,"utf-8") 
does not change anything for row 33.

What is the expected output? What do you see instead?

I am assuming that Refine parses the entire file, but then I recall perhaps, 
only the first 20 or 30 lines of a file are parsed to detect the encoding ?  If 
so, then line 33 may have been missed?

Henceforth, I should also be able to easily reinterpret the values back to 
utf-8, but unfortunately, that also does not seem to work any longer.
Diacritic characters are presented correctly in various text tools as UTF-8 
encoded with the attached exported TSV file.

Original issue reported on code.google.com by thadguidry on 19 Oct 2010 at 2:17

Attachments:

GoogleCodeExporter commented 8 years ago
Screenshot http://awesomescreenshot.com/0b22ju666 of what step 4 looks like on 
my Windows PC with defaults selected when creating new project file from 
attached .TSV file.

Original comment by thadguidry on 19 Oct 2010 at 2:21

GoogleCodeExporter commented 8 years ago
If, however, I change the encoding of the attached file to UTF-8 using 
Notepad++ or similar text tool, and save and then create a new project from 
that newly encoded file, Refine does seem to detect UTF-8 completely and 
correctly display the diacritic characters for row 33.

Hmm, perhaps the fault lies at the beginning with the Export function for TSV 
in Refine?  Does the export use UTF-8 or default to ASCII instead ?  If either, 
should it ask you which encoding you want to export as ?

Original comment by thadguidry on 19 Oct 2010 at 2:29

GoogleCodeExporter commented 8 years ago
Note that you can use `return value.decode('utf-8')` using jython instead of 
GEL, and the values will be processed correctly.

Original comment by jayl...@gmail.com on 11 Nov 2010 at 1:26

GoogleCodeExporter commented 8 years ago
This is a duplicate of issue 237.  The character encoding guesser was leaving 
the project encoding unset if it got a confidence value below the threshold.

The reason it's guessing wrong is that it only looks at the first 4k (approx.) 
of the file and your first non-ASCII characters are beyond that boundary.  I 
investigating increasing the lookahead, but it appears that only the first 3881 
bytes are available at the time the guessing is done.  Changing this would 
require restructuring things.

Original comment by tfmorris on 27 Nov 2010 at 12:38