scelis / twine

Twine is a command line tool for managing your strings and their translations.
Other
840 stars 151 forks source link

Can't process django.po file "invalid byte sequence in UTF-8" #285

Open anentropic opened 5 years ago

anentropic commented 5 years ago

Twine version 1.0.6

$ twine consume-all-localization-files twine.txt locale/ --consume-all --consume-comments --format=django
Traceback (most recent call last):
    13: from /Users/anentropic/.rvm/gems/ruby-2.6.3/bin/ruby_executable_hooks:24:in `<main>'
    12: from /Users/anentropic/.rvm/gems/ruby-2.6.3/bin/ruby_executable_hooks:24:in `eval'
    11: from /Users/anentropic/.rvm/gems/ruby-2.6.3/bin/twine:23:in `<main>'
    10: from /Users/anentropic/.rvm/gems/ruby-2.6.3/bin/twine:23:in `load'
     9: from /Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/bin/twine:4:in `<top (required)>'
     8: from /Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/lib/twine/runner.rb:33:in `run'
     7: from /Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/lib/twine/runner.rb:190:in `consume_all_localization_files'
     6: from /Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/lib/twine/runner.rb:190:in `glob'
     5: from /Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/lib/twine/runner.rb:193:in `block in consume_all_localization_files'
     4: from /Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/lib/twine/runner.rb:323:in `read_localization_file'
     3: from /Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/lib/twine/runner.rb:323:in `open'
     2: from /Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/lib/twine/runner.rb:325:in `block in read_localization_file'
     1: from /Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/lib/twine/formatters/django.rb:22:in `read'
/Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/lib/twine/formatters/django.rb:22:in `match': invalid byte sequence in UTF-8 (ArgumentError)

Unfortunately the unhandled exception does not give any information about the location of the bad char within the file.

We're using these .po files fine in our Django project so I'm not sure they really contain any wrongly encoded data.

At the top of the file there's an entry like:

msgid ""
msgstr ""
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"

if I use --encoding=ASCII-8BIT:

twine consume-all-localization-files twine.txt garage/locale/ --consume-all --consume-comments --format=django --encoding=ASCII-8BIT

then it logs Adding new definition <msg id> for all the messages in the .po but fails when writing result to the twine.txt with this error:

Traceback (most recent call last):
    19: from /Users/anentropic/.rvm/gems/ruby-2.6.3/bin/ruby_executable_hooks:24:in `<main>'
    18: from /Users/anentropic/.rvm/gems/ruby-2.6.3/bin/ruby_executable_hooks:24:in `eval'
    17: from /Users/anentropic/.rvm/gems/ruby-2.6.3/bin/twine:23:in `<main>'
    16: from /Users/anentropic/.rvm/gems/ruby-2.6.3/bin/twine:23:in `load'
    15: from /Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/bin/twine:4:in `<top (required)>'
    14: from /Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/lib/twine/runner.rb:33:in `run'
    13: from /Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/lib/twine/runner.rb:201:in `consume_all_localization_files'
    12: from /Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/lib/twine/runner.rb:55:in `write_twine_data'
    11: from /Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/lib/twine/twine_file.rb:180:in `write'
    10: from /Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/lib/twine/twine_file.rb:180:in `open'
     9: from /Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/lib/twine/twine_file.rb:181:in `block in write'
     8: from /Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/lib/twine/twine_file.rb:181:in `each'
     7: from /Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/lib/twine/twine_file.rb:188:in `block (2 levels) in write'
     6: from /Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/lib/twine/twine_file.rb:188:in `each'
     5: from /Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/lib/twine/twine_file.rb:206:in `block (3 levels) in write'
     4: from /Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/lib/twine/twine_file.rb:206:in `each'
     3: from /Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/lib/twine/twine_file.rb:207:in `block (4 levels) in write'
     2: from /Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/lib/twine/twine_file.rb:224:in `write_value'
     1: from /Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/lib/twine/twine_file.rb:224:in `puts'
/Users/anentropic/.rvm/gems/ruby-2.6.3/gems/twine-1.0.6/lib/twine/twine_file.rb:224:in `write': "\xC3" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)

If I modify django.rb in twine like:

        while line = io.gets
          if line != nil
            line = line.scrub("BADCHAR")
          end

...then I'm able to get complete output in my twine.txt file with no errors.

Curiously the replacement BADCHAR does not appear anywhere in the output.

sebastianludwig commented 5 years ago

Hi @anentropic, thanks for opening the issue and sorry for not getting back to you any sooner. Could you provide a minimal example as file that exhibits this problem?

In general, are you sure the file is ASCII-8BIT encoded? It says charset=UTF-8 directly above... Did you encounter any problems without the --encoding parameter?