UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 132693: invalid start byte

leosh64 commented 1 year ago

Getting this error during generation of embeddings:

Traceback (most recent call last):
  File "/home/user/.local/bin/sem", line 8, in <module>
    sys.exit(main())
  File "/home/user/.local/lib/python3.10/site-packages/semantic_code_search/cli.py", line 84, in main
    query_func(args)
  File "/home/user/.local/lib/python3.10/site-packages/semantic_code_search/cli.py", line 38, in query_func
    do_query(args, model)
  File "/home/user/.local/lib/python3.10/site-packages/semantic_code_search/query.py", line 51, in do_query
    do_embed(args, model)
  File "/home/user/.local/lib/python3.10/site-packages/semantic_code_search/embed.py", line 82, in do_embed
    functions = _get_repo_functions(
  File "/home/user/.local/lib/python3.10/site-packages/semantic_code_search/embed.py", line 71, in _get_repo_functions
    file_content = f.read()
  File "/usr/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 132693: invalid start byte

after it already successfully processed quite a few files:

 27%|████████████████████████▎                                                                 | 35036/130013 [00:33<01:30, 1047.23it/s]

leosh64 commented 1 year ago

As workaround, I just added try/catch to the affected lines:

def _get_repo_functions(root, supported_file_extensions, relevant_node_types):
    functions = []
    print('Extracting functions from {}'.format(root))
    for fp in tqdm([root + '/' + f for f in os.popen('git -C {} ls-files'.format(root)).read().split('\n')]):
        if not os.path.isfile(fp):
            continue
        with open(fp, 'r') as f:
            lang = supported_file_extensions.get(fp[fp.rfind('.'):])
            if lang:
                try:
                    parser = get_parser(lang)
                    file_content = f.read()
                    tree = parser.parse(bytes(file_content, 'utf8'))
                    all_nodes = list(_traverse_tree(tree.root_node))
                    functions.extend(_extract_functions(
                        all_nodes, fp, file_content, relevant_node_types))
                except Exception as e:
                    print(f"Hit error while parsing {fp}: {e}")
    return functions

It shows quite a lot of third-party files in my repo. Since these are third-party, I cannot update/fix them. Should sem be made robust against such issues?

Maybe the requirement to have UTF-8 encoding for the files could be dropped. Ideas: https://stackoverflow.com/questions/22216076/unicodedecodeerror-utf8-codec-cant-decode-byte-0xa5-in-position-0-invalid-s

nnWhisperer commented 1 year ago

Using yours code, I looked at non-utf8 files and changed their encodings; then restarted sem; now it goes through fixed non-utf-8 files.

sturdy-dev / semantic-code-search

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 132693: invalid start byte #28