richox / libzling

fast and niubility compression library.
88 stars 20 forks source link

Undefined behavior detected by ubsan #8

Closed nemequ closed 8 years ago

nemequ commented 9 years ago

ubsan with latest zling (870b761cc2860c400c9be2cd0f54a51aadf9594d):

$ ./zling_demo e ~/local/src/squash-benchmark/alice29.txt alice29.txt.zling && ./zling_demo d alice29.txt.zling alice29.txt
zling:
   light-weight lossless data compression utility
   by Zhang Li <zhangli10 at baidu.com>

/home/nemequ/local/src/libzling/src/libzling_lz.cpp:46:62: runtime error: load of misaligned address 0x7f2305dbd012 for type 'unsigned char', which requires 4 byte alignment
0x7f2305dbd012: note: pointer points here
 00 00  0d 0a 0d 0a 0d 0a 0d 0a  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  41 4c 49 43 45 27
              ^ 
/home/nemequ/local/src/libzling/src/libzling_lz.cpp:60:51: runtime error: load of misaligned address 0x7f2305dbd01a for type 'unsigned char', which requires 4 byte alignment
0x7f2305dbd01a: note: pointer points here
 0d 0a  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  41 4c 49 43 45 27 53 20  41 44 56 45 4e 54
              ^ 
/home/nemequ/local/src/libzling/src/libzling_lz.cpp:60:51: runtime error: load of misaligned address 0x7f2305dbd019 for type 'unsigned char', which requires 4 byte alignment
0x7f2305dbd019: note: pointer points here
 0a 0d 0a  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  41 4c 49 43 45 27 53 20  41 44 56 45 4e
              ^ 
/home/nemequ/local/src/libzling/src/libzling_lz.cpp:63:69: runtime error: load of misaligned address 0x7f2305dbd01a for type 'unsigned char', which requires 4 byte alignment
0x7f2305dbd01a: note: pointer points here
 0d 0a  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  41 4c 49 43 45 27 53 20  41 44 56 45 4e 54
              ^ 
/home/nemequ/local/src/libzling/src/libzling_lz.cpp:63:69: runtime error: load of misaligned address 0x7f2305dbd019 for type 'unsigned char', which requires 4 byte alignment
0x7f2305dbd019: note: pointer points here
 0a 0d 0a  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  41 4c 49 43 45 27 53 20  41 44 56 45 4e
              ^ 
/home/nemequ/local/src/libzling/src/libzling_lz.cpp:68:66: runtime error: load of misaligned address 0x7f2305dbd025 for type 'unsigned char', which requires 2 byte alignment
0x7f2305dbd025: note: pointer points here
 20 20 20 20 20 20 20  41 4c 49 43 45 27 53 20  41 44 56 45 4e 54 55 52  45 53 20 49 4e 20 57 4f  4e
             ^ 
/home/nemequ/local/src/libzling/src/libzling_lz.cpp:230:62: runtime error: load of misaligned address 0x7f2305dbd026 for type 'unsigned char', which requires 4 byte alignment
0x7f2305dbd026: note: pointer points here
 20 20 20 20 20 20  41 4c 49 43 45 27 53 20  41 44 56 45 4e 54 55 52  45 53 20 49 4e 20 57 4f  4e 44
             ^ 
/home/nemequ/local/src/libzling/src/libzling_lz.cpp:230:62: runtime error: load of misaligned address 0x7f2305dbd025 for type 'unsigned char', which requires 4 byte alignment
0x7f2305dbd025: note: pointer points here
 20 20 20 20 20 20 20  41 4c 49 43 45 27 53 20  41 44 56 45 4e 54 55 52  45 53 20 49 4e 20 57 4f  4e
             ^ 
/home/nemequ/local/src/libzling/src/libzling_lz.cpp:68:66: runtime error: load of misaligned address 0x7f2305dbd059 for type 'unsigned char', which requires 2 byte alignment
0x7f2305dbd059: note: pointer points here
 20 20 20  20 20 20 20 20 20 20 20  20 20 20 20 20 20 4c 65  77 69 73 20 43 61 72 72  6f 6c 6c 0d 0a
              ^ 
  0.15 MB =>   0.05 MB 35.30%, 0.322 sec, speed=0.473 MB/sec
encode: 152089 => 53682, time=0.322 sec, speed=0.473 MB/sec
zling:
   light-weight lossless data compression utility
   by Zhang Li <zhangli10 at baidu.com>

/home/nemequ/local/src/libzling/src/libzling_lz.cpp:83:96: runtime error: load of misaligned address 0x7fb92920f019 for type 'unsigned char', which requires 4 byte alignment
0x7fb92920f019: note: pointer points here
 0a 0d 0a  20 20 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00
              ^ 
/home/nemequ/local/src/libzling/src/libzling_lz.cpp:83:96: runtime error: store to misaligned address 0x7fb92920f01a for type 'unsigned char', which requires 4 byte alignment
0x7fb92920f01a: note: pointer points here
 0d 0a  20 20 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00
              ^ 
/home/nemequ/local/src/libzling/src/libzling_lz.cpp:88:96: runtime error: load of misaligned address 0x7fb92920f019 for type 'unsigned char', which requires 4 byte alignment
0x7fb92920f019: note: pointer points here
 0a 0d 0a  20 20 20 20 20 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00
              ^ 
/home/nemequ/local/src/libzling/src/libzling_lz.cpp:88:96: runtime error: store to misaligned address 0x7fb92920f01d for type 'unsigned char', which requires 4 byte alignment
0x7fb92920f01d: note: pointer points here
 20 20 20 20 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00
             ^ 
  0.15 MB <=   0.05 MB 35.30%, 0.098 sec, speed=1.553 MB/sec
decode: 152089 <= 53682, time=0.098 sec, speed=1.552 MB/sec
nemequ commented 9 years ago

Continuing the conversation from commit 9d2c74535a9f1b350b2cd75dd69ac034c5607f01, (which I completely forgot about until this morning, sorry):

@FrankHB wrote (talking about § 5.2.10 paragraph 7 of n3337)

The wording here is "unspecified" rather than "undefined". This means a C++ compiler has no right to assume misaligned pointer conversions never occur as a C11-conforming compiler.

I'm not sure where you're getting that. § 1.3.25 defines "unspecified behavior" as:

behavior, for a well-formed program construct and correct data, that depends on the implementation

[Note: The implementation is not required to document which behavior occurs. The range of possible behaviors is usually delineated by this International Standard. — end note]

Of course, AFAICT the standard doesn't actually delineate the range of possible behaviors in this case, but I don't know how you get from there to the only possible behavior being that it works. To me, this reads more as "it may or may work, depending on the implementation".

FrankHB commented 9 years ago

Both "undefined" and "unspecified" render the resulted behavior implementation-dependent. The most obvious difference is between "nonportable or erroneous" and "correct", i.e. if the guarantees given by rules of the standard would still work when such behavior occurred. So if a program has both unspecified behavior and undefined behavior, the latter wins.

For C++, the intent of the difference might be more explicit (than C):

(N3337 1.9)

5 A conforming implementation executing a well-formed program shall produce the same observable behavior as one of the possible executions of the corresponding instance of the abstract machine with the same program and the same input. However, if any such execution contains an undefined operation, this International Standard places no requirement on the implementation executing that program with that input (not even with regard to operations preceding the first undefined operation).

This can be simply read as, everything is implementation-dependent if a program contains undefined behavior. If the program contains no undefined behavior, the implementation has no rights to propagate the "unspecified" to the whole program.

nemequ commented 9 years ago

In that case,

This means a C++ compiler has no right to assume misaligned pointer conversions never occur as a C11-conforming compiler.

Seems completely wrong. If the behavior is implementation dependent, the implementation is free to assume misaligned pointer conversions never occur if that is the implementation they chose… which, AFAIK, is what they do (at least for C). Since the range of possible behaviors is not enumerated in the standard, this seems like a perfectly valid assumption to make.

My understanding is that the only reason misaligned accesses often still work is that most instructions on x86, as well as newer ARM, allow misaligned access. However certain instructions (like some vector instructions) don't, so you can easily run into trouble once the compiler gets better at optimizing your code (this recently happened to LZ4 with gcc ≥ 4.8). And, of course, other architectures aren't as forgiving—MIPS tends to have lots of trouble.

FrankHB commented 9 years ago

No. If there is no undefined behavior, the implementation should still do something sane to make other parts of the program work as long as the behavior does not depend on the result of misaligned pointer operations, whether there exists unspecified behavior/implementation-defined behavior, etc. Once there is undefined behavior, the implementation need not do so and can just simply crash even when compiling. That is what required and allowed by the standard.

Actual compilers cannot have too many options and crash during compiling sounds stupid. So there should (likely) be the output program, but this program would probably misbehave (or not) on running, depending on implementation (the compiler and/or the runtime) you use. However, they are still equally evil to the language spec. On the other hands, you have already made the problem unnecessarily complex. If you want to know what would it be exactly, you'd better reference some other credible specs of the implementations, or read the low level (assembly) code yourself, on each specific target platforms, as you have illustrated for MIPS.

richox commented 9 years ago

The bug happens when encoder produces overflow matches. now I stop encoder from matching last handred bytes and this bug should be fixed.