godfat / cultivate

0 stars 0 forks source link

Latest big5-to-utf8 still cause error #6

Open ac9607 opened 6 years ago

ac9607 commented 6 years ago

With latest version of big5-to-utf8 will cause invalid byte sequence in UTF-8 (ArgumentError) It's fine with older version

wangyps-MacBook-Pro-2:cultivate wangyp$ ./bin/import /Users/wangyp/Documents/cultivate/data20180907 Traceback (most recent call last): 9: from ./bin/import:7:in <main>' 8: from /Users/wangyp/Documents/cultivate/lib/cultivate.rb:11:intraverse' 7: from /Users/wangyp/Documents/cultivate/lib/cultivate.rb:11:in each' 6: from /Users/wangyp/Documents/cultivate/lib/cultivate.rb:15:inblock in traverse' 5: from /Users/wangyp/Documents/cultivate/lib/cultivate.rb:15:in each' 4: from ./bin/import:8:inblock in

' 3: from /Users/wangyp/Documents/cultivate/lib/cultivate/model.rb:20:in import' 2: from /Users/wangyp/Documents/cultivate/lib/cultivate/model.rb:40:inload_rows' 1: from /Users/wangyp/Documents/cultivate/lib/cultivate/model.rb:50:in load_csv' /Users/wangyp/Documents/cultivate/lib/cultivate/model.rb:50:ingsub': invalid byte sequence in UTF-8 (ArgumentError)

godfat commented 6 years ago

It's giving this errors for some files, or all of them? If it's just some files, send me a sample privately?

ac9607 commented 6 years ago

All of them

godfat commented 6 years ago

@ac9607 It works for me though?

21:21 ~/p/g/cultivate master> ruby bin/big5-to-utf8 data/
21:21 ~/p/g/cultivate master>
godfat commented 5 years ago

Still having issue?

ac9607 commented 5 years ago

The error does not occur constantly, so I may need the log to find out what was happened. By the way, In recent try, I encoded files with mixture of UTF-8 encoded and non-UTF-8 encoded files smoothly without any error occurred, but the error occurred at importing.

wangyps-MacBook-Pro-2:cultivate wangyp$ ./bin/big5-to-utf8_new /Users/wangyp/Documents/cultivate/dataall wangyps-MacBook-Pro-2:cultivate wangyp$ ./bin/import /Users/wangyp/Documents/cultivate/dataall 9: from ./bin/import:7:in <main>' 8: from /Users/wangyp/Documents/cultivate/lib/cultivate.rb:11:intraverse' 7: from /Users/wangyp/Documents/cultivate/lib/cultivate.rb:11:in each' 6: from /Users/wangyp/Documents/cultivate/lib/cultivate.rb:15:inblock in traverse' 5: from /Users/wangyp/Documents/cultivate/lib/cultivate.rb:15:in each' 4: from ./bin/import:8:inblock in

' 3: from /Users/wangyp/Documents/cultivate/lib/cultivate/model.rb:20:in import' 2: from /Users/wangyp/Documents/cultivate/lib/cultivate/model.rb:40:inload_rows' 1: from /Users/wangyp/Documents/cultivate/lib/cultivate/model.rb:50:in load_csv' /Users/wangyp/Documents/cultivate/lib/cultivate/model.rb:50:ingsub': invalid byte sequence in UTF-8 (ArgumentError)

godfat commented 5 years ago

Oh, yes, I didn't realize you're talking about importing. So it looks like the problem is that big5-to-utf8 doesn't properly transcode all data into UTF-8.

Could you make sure big5-to-utf8_new is update-to-date with the current bin/big5-to-utf8?

godfat commented 5 years ago

Bah, wait, I think bin/big5-to-utf8 just isn't expected to work that way. I forgot that:

content = File.read(path, :encoding => 'big5-uao')

This simply assumes that all the files are encoded in Big5-UAO, which is not the case here. We might need to guess the encoding if we're mixing data...

godfat commented 5 years ago

Umm... but this still can't explain it completely because we're not writing to the files if they're not valid UTF-8.

I think I still need the data to check what's up. Please find it and send me somehow so I could better diagnose this.