ramyakalyan / google-refine

Automatically exported from code.google.com/p/google-refine
0 stars 0 forks source link

Intermittent charset detection failure #404

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Import a CSV file

What is the expected output? What do you see instead?

Occasionally the encoding guessing code gets it wrong and what's imported is 
junk.

More here: 
http://groups.google.com/group/google-refine-dev/browse_thread/thread/2f95a91cbd
174865

Original issue reported on code.google.com by paulm%pa...@gtempaccount.com on 13 Jun 2011 at 6:42

Attachments:

GoogleCodeExporter commented 8 years ago
Just to recap Paul's analysis from the email so that we have all the info in 
one place, he suspects that short reads from the input stream can leave the 
buffer only partially filled, leaving a large number of NULs which corrupt the 
character set analysis.

Original comment by tfmorris on 13 Jun 2011 at 9:09

GoogleCodeExporter commented 8 years ago
Fixed in r2102.  Thanks for the patch.  It was committed mostly as given except 
I conditionalized the wrapping of the input stream to only do it if the input 
stream doesn't support mark/reset.

p.s. It's a nit, but it'll make your patch less noisy if you turn off automatic 
"cleanup" of formatting in your editor/IDE.  Not a big deal for a little patch 
like this, but could complicate things for a larger patch.

Original comment by tfmorris on 14 Jun 2011 at 5:58