scottrogowski / code2flow

Pretty good call graphs for dynamic languages
MIT License
3.98k stars 295 forks source link

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 4895: character maps to <undefined> #28

Closed sgbaird closed 3 years ago

sgbaird commented 3 years ago
C:\Users\sterg\Documents\GitHub\sparks-baird\CrabNet\crabnet>code2flow model.py
Code2Flow: Found 1 files from sources argument.
Code2Flow: Implicitly detected language as 'py'.
Code2Flow: Processing 1 source file(s).
Code2Flow:   model.py
Traceback (most recent call last):
  File "C:\Users\sterg\AppData\Local\Programs\Python\Python39\Scripts\code2flow-script.py", line 33, in <module>
    sys.exit(load_entry_point('code2flow==2.2.0', 'console_scripts', 'code2flow')())
  File "c:\users\sterg\appdata\local\programs\python\python39\lib\site-packages\code2flow\engine.py", line 625, in main
    code2flow(
  File "c:\users\sterg\appdata\local\programs\python\python39\lib\site-packages\code2flow\engine.py", line 531, in code2flow
    file_groups, all_nodes, edges = map_it(sources, language, no_trimming,
  File "c:\users\sterg\appdata\local\programs\python\python39\lib\site-packages\code2flow\engine.py", line 317, in map_it
    raise ex
  File "c:\users\sterg\appdata\local\programs\python\python39\lib\site-packages\code2flow\engine.py", line 312, in map_it
    file_ast_trees.append((source, language.get_tree(source, lang_params)))
  File "c:\users\sterg\appdata\local\programs\python\python39\lib\site-packages\code2flow\python.py", line 155, in get_tree
    tree = ast.parse(f.read())
  File "c:\users\sterg\appdata\local\programs\python\python39\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 4895: character maps to <undefined>
scottrogowski commented 3 years ago

Interesting. Code2flow handles Python files encoded in standard unicode ok so my guess (not sure) is that your file has a non-standard encoding.

Do you have any more details or, better yet, could you share the file?

sgbaird commented 3 years ago

CrabNet model.py My guess is probably the emoji 😅 E.g. ♻🗑️

scottrogowski commented 3 years ago

I downloaded your file and it actually processes for me. Looking closer at the traceback, it looks like on your machine, the file is encoded in Windows-1252.

I'm not familiar with Windows so don't immediately know how to address this. A simple temporary workaround might be if you can manually convert the file(s) to Unicode but I don't know how you would do this in Windows.

I'm swamped on all sorts of things right now so probably won't be able to investigate further for a week or two.

sgbaird commented 3 years ago

Thank you! That's perfect. Didn't consider that it might be a Windows-specific issue.

scottrogowski commented 3 years ago

Could you spare me two favors? It would be very helpful to help me figure out what's wrong here:

  1. Run this and let me know what comes out python3 -c "import locale; print(locale.getpreferredencoding())". This command is to determine how files are read by default with your Python.
  2. Install this https://pypi.org/project/charset-normalizer/ pip3 install charset-normalizer and run normalizer path\to\crabnetmodel.py. This command should output a small JSON blurb which will let me know that specific file's encoding
sgbaird commented 3 years ago

How files are read by default in Python

python3 -c "import locale; print(locale.getpreferredencoding())"python3 -c "import locale; print(locale.getpreferredencoding())"
cp1252

Normalizer Outputs

model.py

normalizer .\CrabNet\crabnet\model.py
{
    "path": "C:\\Users\\sterg\\Documents\\GitHub\\sparks-baird\\ElM2D\\CrabNet\\crabnet\\model.py",
    "encoding": "utf_8",
    "encoding_aliases": [
        "u8",
        "utf",
        "utf8",
        "utf8_ucs2",
        "utf8_ucs4",
        "cp65001"
    ],
    "alternative_encodings": [],
    "language": "English",
    "alphabets": [
        "Basic Latin",
        "Control character",
        "Emoticons range(Emoji)",
        "Miscellaneous Symbols",
        "Miscellaneous Symbols and Pictographs",
        "Transport and Map Symbols",
        "Variation Selectors"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.0,
    "coherence": 100.0,
    "unicode_path": null,
    "is_preferred": true
}

kingcrab.py

Note that code2flow crabnet/kingcrab.py produces out.png: out

normalizer .\CrabNet\crabnet\kingcrab.py
{
    "path": "C:\\Users\\sterg\\Documents\\GitHub\\sparks-baird\\ElM2D\\CrabNet\\crabnet\\kingcrab.py",
    "encoding": "ascii",
    "encoding_aliases": [
        "646",
        "ansi_x3.4_1968",
        "ansi_x3_4_1968",
        "ansi_x3.4_1986",
        "cp367",
        "csascii",
        "ibm367",
        "iso646_us",
        "iso_646.irv_1991",
        "iso_ir_6",
        "us",
        "us_ascii"
    ],
    "alternative_encodings": [],
    "language": "English",
    "alphabets": [
        "Basic Latin",
        "Control character"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.0,
    "coherence": 100.0,
    "unicode_path": null,
    "is_preferred": true
}

train_crabnet.py

And finally, one more file from the same repository (train_crabnet.py):

normalizer .\CrabNet\train_crabnet.py
{
    "path": "C:\\Users\\sterg\\Documents\\GitHub\\sparks-baird\\ElM2D\\CrabNet\\train_crabnet.py",
    "encoding": "utf_8",
    "encoding_aliases": [
        "u8",
        "utf",
        "utf8",
        "utf8_ucs2",
        "utf8_ucs4",
        "cp65001"
    ],
    "alternative_encodings": [],
    "language": "Unknown",
    "alphabets": [
        "Basic Latin",
        "Control character"
    ],
    "has_sig_or_bom": false,
    "chaos": 10.0,
    "coherence": 0.0,
    "unicode_path": null,
    "is_preferred": true
}

out

DiffChecker

(https://www.diffchecker.com/)

kingcrab.py vs. train_crabnet.py

image

train_crabnet.py vs. model.py

image

sgbaird commented 3 years ago

Looks like it was the emoji. After getting rid of it:

code2flow crabnet/model.py

out

normalizer .\CrabNet\crabnet\model.py
{
    "path": "C:\\Users\\sterg\\Documents\\GitHub\\sparks-baird\\ElM2D\\CrabNet\\crabnet\\model.py",
    "encoding": "ascii",
    "encoding_aliases": [
        "646",
        "ansi_x3.4_1968",
        "ansi_x3_4_1968",
        "ansi_x3.4_1986",
        "cp367",
        "csascii",
        "ibm367",
        "iso646_us",
        "iso_646.irv_1991",
        "iso_ir_6",
        "us",
        "us_ascii"
    ],
    "alternative_encodings": [],
    "language": "English",
    "alphabets": [
        "Basic Latin",
        "Control character"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.0,
    "coherence": 100.0,
    "unicode_path": null,
    "is_preferred": true
}
scottrogowski commented 3 years ago

@sgbaird I think I have a fix. Could you pull it down and verify? https://github.com/scottrogowski/code2flow/pull/31

scottrogowski commented 3 years ago

Addressed in the 2.3.0 release