Open leosh64 opened 1 year ago
As workaround, I just added try/catch to the affected lines:
def _get_repo_functions(root, supported_file_extensions, relevant_node_types):
functions = []
print('Extracting functions from {}'.format(root))
for fp in tqdm([root + '/' + f for f in os.popen('git -C {} ls-files'.format(root)).read().split('\n')]):
if not os.path.isfile(fp):
continue
with open(fp, 'r') as f:
lang = supported_file_extensions.get(fp[fp.rfind('.'):])
if lang:
try:
parser = get_parser(lang)
file_content = f.read()
tree = parser.parse(bytes(file_content, 'utf8'))
all_nodes = list(_traverse_tree(tree.root_node))
functions.extend(_extract_functions(
all_nodes, fp, file_content, relevant_node_types))
except Exception as e:
print(f"Hit error while parsing {fp}: {e}")
return functions
It shows quite a lot of third-party files in my repo. Since these are third-party, I cannot update/fix them. Should sem
be made robust against such issues?
Maybe the requirement to have UTF-8 encoding for the files could be dropped. Ideas: https://stackoverflow.com/questions/22216076/unicodedecodeerror-utf8-codec-cant-decode-byte-0xa5-in-position-0-invalid-s
Using yours code, I looked at non-utf8 files and changed their encodings; then restarted sem; now it goes through fixed non-utf-8 files.
Getting this error during generation of embeddings:
after it already successfully processed quite a few files: