protocool / AckMate

TextMate plugin (Cocoa) shell for running 'ack'
MIT License
723 stars 34 forks source link

offending filename should be logged on Malformed UTF-8 error #18

Open ericcj opened 13 years ago

ericcj commented 13 years ago

also, this bug was closed for a bogus reason and should be reopened: https://github.com/protocool/AckMate/issues/closed#issue/13

0xradical commented 13 years ago

+1 for re-opening issue #13

Perl has different ways of treating utf8 stuff: http://perldoc.perl.org/Encode.html#UTF-8-vs.-utf8-vs.-UTF8

The ackmate_ack that is shipped with ackmate uses Encode::encode_utf8($buffer) instead of Encode::encode('UTF-8',$buffer). The previous malformed erros I was getting disappeared when I used the latter version of encode, although I don't know why (yet). I thought this could help others investigate this issue.

protocool commented 13 years ago

13 was closed because people were getting sick of getting emails about it.

As for using the latter version, I can't recall where I saw it, but I recall seeing a statement that encode_utf8 was the correct call. It's also odd that encode('UTF-8', $buffer) would result in fewer errors if it's more strict...

However, it's been a very long time since I burned any hours on this issue.

If you can come up with a few test files that prove it works better then I will gratefully apply any pull request.

Trev

0xradical commented 13 years ago

Hi Trev, I'll see if I can come up with something and possibly submit a pull request to your ack fork.

Best regards.

tedeh commented 13 years ago

+1 for this issue, I guess demanding a fix for the actual error might be too much but at least outputting the erroneous file name should be easy for anyone versed in Perl.

jjb commented 13 years ago

Alright folks, I've put together a solution for reporting offending files. Tell me what you think: https://github.com/protocool/AckMate/wiki/Unicode-UTF-8-error-message

obeattie commented 13 years ago

Would be really nice if this could be fixed. As far as I'm aware, all my files are encoded as UTF-8, but given that the kinds of directories I need to search over contain upward of 1,000 files, it's hard to track down the offending article. I've downgraded for now, but you know how we developers like to be on the bleeding edge :)

Even if someone could put together a script that will report all the "broken" files (or suggest one that already exists) and of course ideally fix them (wishful thinking I know!) that would be brilliant too.

artzte commented 13 years ago

I added a similar patch in another location of ackmate_ack: https://gist.github.com/1135457

airways commented 12 years ago

Addd a note to #13 just because it's the first thing you find in google, but I thought I should comment here:

Adding these lines to the end of needs_line_scan fixed this issue completely for me:

use Encode;
$buffer =  encode("UTF-8", $buffer);
return $buffer =~ /$regex/m;

Granted, there may be better places to do this encoding, but I just needed a quick fix. Hopefully this helps someone else out if they find the tickets.

joshgoebel commented 12 years ago

At the end of my needs_line_scan method (around line 2597) I changed the end of the method to:

    my $regex = $opt->{regex};
    if ( ! utf8::valid($buffer) ) {
            $buffer = Encode::encode_utf8($buffer);
    }
        return $buffer =~ /$regex/m;

And you have to add 'use utf8;' to the beginning of the script... does this seem reasonable?

protocool commented 12 years ago

Guys,

thanks for taking the lead on sorting this stuff out.

I've seen someone mention that setting export LC_CTYPE=en_US.UTF-8 in the environment helps.

I'd like to experiment with the various options before merging into AM proper, and at the same time updating AM's version of ack to the latest.

Can any of you point me to files which cause the UTF explosions so I can test against it?

Thanks, Trev

airways commented 12 years ago

I attempted to add logging as recommended by @jjb above, but the log items did not show up in the output. I can play with it a bit later unless @yyyc514 gets to it first. :)

snowblink commented 12 years ago

@yyyc514 solution works for me

@protocool I was getting stung on language files in ext.js

trisweb commented 12 years ago

No errors should be logged because AFAIK they aren't errors, they're files not in UTF-8 format being interpreted as if they are.

Here's a patch for the ackmate_ack to solve the issue - read the file with :raw format first, then convert to UTF-8 using the Encode lib. Does not complain. It also uses @protocool's (or @airways ?) encode strategy for the other possible error section, with which I had no issues.

I have not yet submitted this to the ack project because I am unsure if it's universally applicable, but it should work for our purposes.

To apply, first paste into an ackmate_ack.patch somewhere, then:

cd ~/Library/Application Support/TextMate/Plugins/AckMate.tmplugin/Contents/Resources cp [path/to/ackmate_ack.patch] . patch ackmate_ack < ackmate_ack.patch

--- ackmate_ack 2012-02-15 13:46:32.000000000 -0500
+++ ackmate_ack 2012-02-15 13:59:25.000000000 -0500
@@ -1553,12 +1553,15 @@

     # If there's no extension, or we don't recognize it, check the shebang line
     my $fh;
-    if ( !open( $fh, '<', $filename ) ) {
+    if ( !open( $fh, '<:raw', $filename ) ) {
         App::Ack::warn( "$filename: $!" );
         return;
     }
     my $header = <$fh>;
     close $fh;
+   use Encode;
+   $header = encode("UTF-8", $header);

     if ( $header =~ /^#!/ ) {
         return ($1,TEXT)       if $header =~ /\b(ruby|p(?:erl|hp|ython))\b/;
@@ -2607,10 +2610,14 @@
         App::Ack::warn( "$self->{filename}: $!" );
         return 1;
     }
+
     return 0 unless $rc && ( $rc == $size || length(Encode::encode_utf8($buffer)) == $size );

     my $regex = $opt->{regex};
-    return $buffer =~ /$regex/m;
+
+   use Encode;
+   $buffer = encode("UTF-8", $buffer);
+   return $buffer =~ /$regex/m;
 }
airways commented 12 years ago

This is basically the same as my patch for the malformed UTF-8 errors, which I posted on #13. I feel like this ticket should just be closed and the patch should be applied. No one has pointed out any issues with it, so I'm not sure why this hasn't been resolved already.

I agree that the request in this ticket seems pointless, we don't need logging for the error - we just need it to go away since ack and ackmate shouldn't tell us what format our files are supposed to be in.

trisweb commented 12 years ago

Correct. I've forked the project and attempted to apply, but the problem, you see, is that ackmate_ack is actually generated from http://github.com/petdance/ack - see: https://github.com/trisweb/AckMate/blob/master/bundle_extras/ackmate_ack.autogenerated

So it's not exactly simple.

mergulhao commented 12 years ago

@protocool I wget this file http://www.fiscontex.com.br/legislacao/ICMS/aliquotainternaicms.htm and the problem happens with it. This file seens to be written on Windows FrontPage and have weird windows encoding chars. Hope it helps.

normann commented 12 years ago

@trisweb 's patch totally work in my case.