Utf8 decode error - Githubissues

childenProtos commented 1 year ago

I am comparing two arm gcc elf files. If I do not specify the bin_dir the report is generated successfully but I get the "Unable to read assembly from binary" warning.

If I specify the correct bin_dir + bin_prefix the warning disappears and instead I get the following output (with utf-8 decode error):

py -m elf_diff --bin_dir tools\arm-gcc\bin --bin_prefix "arm-none-eabi-" --html_dir report2 [OLD].elf [NEW].elf
Tools:
   objdump: tools\arm-gcc\bin\arm-none-eabi-objdump.exe
   nm:      tools\arm-gcc\bin\arm-none-eabi-nm.exe
   readelf:      tools\arm-gcc\bin\arm-none-eabi-readelf.exe
   size:    tools\arm-gcc\bin\arm-none-eabi-size.exe
Verifying config keys...
Symbol selection regex:
   old binary: 'None'
   new binary: 'None'
Symbol exclusion regex:
   old binary: 'None'
   new binary: 'None'
Parsing symbols of old binary ([OLD].elf)
File format of binary [OLD].elf: elf32-littlearm
Extracting symbols
100% (5577 of 5577) |#####################################################################################| Elapsed Time: 0:00:00 Time:  0:00:00
Gathering instructions
100% (223307 of 223307) |#################################################################################| Elapsed Time: 0:00:00 Time:  0:00:00
Parsing symbols of new binary ([NEW].elf)
File format of binary [NEW].elf: elf32-littlearm
Extracting symbols
100% (5564 of 5564) |#####################################################################################| Elapsed Time: 0:00:00 Time:  0:00:00
Gathering instructions
================================================================================

Traceback (most recent call last):
  File "C:\[...]\Python\Python310\lib\site-packages\elf_diff\__main__.py", line 124, in main
    exportDocument(settings)
  File "C:\[...]\Python\Python310\lib\site-packages\elf_diff\__main__.py", line 66, in exportDocument
    document: ValueTreeNode = generateDocument(settings)
  File "C:\[...]\Python\Python310\lib\site-packages\elf_diff\pair_report_document.py", line 1167, in generateDocument
    meta_document.configureValueTree(value_tree, settings=settings)
  File "C:\[...]\Python\Python310\lib\site-packages\elf_diff\pair_report_document.py", line 976, in configureValueTree
    self.binary_pair = BinaryPair(
  File "C:\[...]\Python\Python310\lib\site-packages\elf_diff\binary_pair.py", line 103, in __init__
    self.new_binary = Binary(
  File "C:\[...]\Python\Python310\lib\site-packages\elf_diff\binary.py", line 78, in __init__
    self._initSymbols()
  File "C:\[...]\Python\Python310\lib\site-packages\elf_diff\binary.py", line 122, in _initSymbols
    self._gatherSymbolInstructions()
  File "C:\[...]\Python\Python310\lib\site-packages\elf_diff\binary.py", line 108, in _gatherSymbolInstructions
    instruction_collector.gatherSymbolInstructions(
  File "C:\[...]\Python\Python310\lib\site-packages\elf_diff\instruction_collector.py", line 136, in gatherSymbolInstructions
    objdump_output: str = runSystemCommand(
  File "C:\[...]\Python\Python310\lib\site-packages\elf_diff\system_command.py", line 33, in runSystemCommand
    output: str = o.decode("utf8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 12242097: invalid start byte

================================================================================
 elf_diff is unconsolable :-( Something went wrong
================================================================================

 Error:  'utf-8' codec can't decode byte 0xfc in position 12242097: invalid start byte

================================================================================
 Don't let this take you down! Have a nice hot coffee and start over.
================================================================================

Is there any way I can debug the source of the error / find out what is causing the wrong utf-8 string?

noseglasses commented 10 months ago

Sorry for this answer coming pretty late. I am currently too busy to work on this project. You might want to try replacing the decode call in line 33 of system_command.py with output: str = o.decode("utf8", errors="ignore"). I am not sure, though, which character causes the decoding to fail.

me21 commented 2 months ago

Here's my two cents on this issue:

I replaced that call with

    try:
      output: str = o.decode("utf8")
    except:
      with open("subprocess_output.txt", "wb") as f:
        f.write(o)
      raise

and got a text file containing the problematic output. In my case it was a section sign (0xA7) in the line containing source code. It appears my sources are encoded not as UTF-8 but as CP1252. After replacing the codec in decode call, elf_diff ran smoothly.

It would be nice to add source file encoding option to elf_diff command. And it may be different for the first and second ELF file.

noseglasses / elf_diff

Utf8 decode error #100