`IncludesPass` and `CommentsPass` have issues with files using "non-standard" encodings

aytey commented 3 years ago

For:

cat unit.c
/*
* für
*/

file unit.c
unit.c: ISO-8859 text

running cvise --start-with-pass IncludesPass gives:

00:00:00 INFO ===< 118085 >===
00:00:00 INFO running 24 interestingness tests in parallel
00:00:00 INFO INITIAL PASSES
00:00:00 INFO ===< IncludesPass >===
00:00:00 WARNING IncludesPass has encountered a non fatal bug: pass got stuck
00:00:00 INFO ===< UnIfDefPass >===
00:00:00 INFO ===< CommentsPass >===
00:00:00 WARNING CommentsPass has encountered a non fatal bug: pass got stuck
00:00:00 INFO ===< IfPass >===
Traceback (most recent call last):
  File "/usr/bin/cvise", line 286, in <module>
    reducer.reduce(pass_group, skip_initial=args.skip_initial_passes)
  File "/usr/share/cvise/cvise.py", line 143, in reduce
    self._run_additional_passes(pass_group['first'])
  File "/usr/share/cvise/cvise.py", line 166, in _run_additional_passes
    self.test_manager.run_pass(p)
  File "/usr/share/cvise/utils/testing.py", line 503, in run_pass
    self.state = self.current_pass.new(self.current_test_case, self.check_sanity)
  File "/usr/share/cvise/passes/ifs.py", line 36, in new
    bs = BinaryState.create(self.__count_instances(test_case))
  File "/usr/share/cvise/passes/ifs.py", line 22, in __count_instances
    for line in in_file.readlines():
  File "/usr/lib64/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 6: invalid start byte

(this was after editing cvise/utils/testing.py to set GIVEUP_CONSTANT to 1)

and creates:

Package: cvise 2.3.0
Git version: unknown
LLVM version: 12.0.0
System: uname_result(system='Linux', node='vistrrdslin0001', release='5.12.13-1-default', version='#1 SMP Mon Jun 28 06:37:23 UTC 2021 (74bd8c0)', machine='x86_64', processor='x86_64')
***************************************************

IncludesPass has encountered a bug:
pass got stuck
state: 2

Please consider tarring up cvise_bug_0
and creating an issue at https://github.com/marxin/cvise/issues and we will try to fix the bug.

***************************************************

and

Package: cvise 2.3.0
Git version: unknown
LLVM version: 12.0.0
System: uname_result(system='Linux', node='vistrrdslin0001', release='5.12.13-1-default', version='#1 SMP Mon Jun 28 06:37:23 UTC 2021 (74bd8c0)', machine='x86_64', processor='x86_64')
***************************************************

CommentsPass has encountered a bug:
pass got stuck
state: -1

Please consider tarring up cvise_bug_1
and creating an issue at https://github.com/marxin/cvise/issues and we will try to fix the bug.

***************************************************

marxin commented 3 years ago

Well, it's technically a bug, but I don't see why would somebody use a non-utf8 encoding for files? Doctor it hurts. Then don't do it :D

aytey commented 3 years ago

I don't see why would somebody use a non-utf8 encoding for files?

The tool my company builds (and one of the reasons I use cvise) automatically builds data-driven unit test "wrappers" for C/C++ -- as part of this, we just "gobble up" whatever code our customers have.

Our customers 100% have code that isn't utf8/ascii (the above comes from a support case I was helping with).

I think that adding something like:

(code I wrote) would make cvise more robust here.

Would you be happy if I made some utility function like safe_open that wraps open along with a call to get_file_encoding)? I'm thinking that something like:

https://github.com/marxin/cvise/blob/38b42d95a5280d890008489f1d31f297bd920e6d/cvise/passes/ifs.py#L21

could then use safe_open vs. "regular" open.

Thoughts?

Alternatively maybe cvise should check if a file is openable as utf8 before starting, and "fail early" (e.g., as part of the initial sanity check) if it isn't?

marxin commented 3 years ago

Thank you for your ideas. So you're using a Python package chardet which seems reasonable to me. A question: if you have a non-utf8 test-case, what about converting it first to utf8 and then let cvise run? At the end you can convert it to a given encoding. The problem I see with safe_open is that we'll need to use it at various places. Similarly for the writing of the output.

aytey commented 3 years ago

How about if cvise (as part of the "initial sanity check") validates if the file can be opened as utf8? If it can't, we have three options:

Bail-out and tell the user "sorry, unsupported"
Add a flag that says "convert to utf8 and try as normal, I understand if the initial tests now might fail and cvise will give up"
Add a flag that says "automatically skip any transformations that need utf8"

We could actually do all three of these; without 2 or 3, 1 happens. If you have 2, then 3 is redundant. If you have 3, 2 is redundant (and it won't run things like IncludesPass).

Thoughts?

marxin commented 3 years ago

Well, I like doing 2) using chardet. That's going to help in most cases (hopefully)..

marxin commented 3 years ago

Well, I like doing 2) using chardet. That's going to help in most cases (hopefully)..

@andrewvaughanj I've just implemented the first version, can you please take a look?

aytey commented 3 years ago

I can try this next week; should args.to_utf8 have a "sanity check" afterwards?

marxin commented 3 years ago

Well, the conversion does happen before the normal sanity check. Thanks for testing.

marxin commented 2 years ago

Any update about the testing, please?

marxin commented 2 years ago

Any update about the testing, please?

PING :)

marxin commented 2 years ago

Any news about this @andrewvaughanj ?

marxin / cvise

`IncludesPass` and `CommentsPass` have issues with files using "non-standard" encodings #59