I found an error running ocw-data-parser on a course while converting the subtitle content. To reproduce, convert the course at PROD/physics in ocw-content-storage
Note: This is a different error than #150. That deals with a VTT conversion error which is caught and logged, but this is a Unicode error which is not caught.
Stacktrace
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/george/Projects/ocw-data-parser/ocw_data_parser/utils.py", line 266, in parse_all
parser = ocw_data_parser.OCWParser(
File "/home/george/Projects/ocw-data-parser/ocw_data_parser/ocw_data_parser.py", line 468, in __init__
self.populate_vtt_files()
File "/home/george/Projects/ocw-data-parser/ocw_data_parser/ocw_data_parser.py", line 943, in populate_vtt_files
new_json = convert_to_vtt(loaded_json)
File "/home/george/Projects/ocw-data-parser/ocw_data_parser/utils.py", line 317, in convert_to_vtt
webvtt.from_srt(Path(temp_dir) / "data").save()
File "/home/george/ocw-data-parser-venv/lib/python3.8/site-packages/webvtt_py-0.4.6-py3.8.egg/webvtt/webvtt.py", line 48, in from_srt
parser = SRTParser().read(file)
File "/home/george/ocw-data-parser-venv/lib/python3.8/site-packages/webvtt_py-0.4.6-py3.8.egg/webvtt/parsers.py", line 24, in read
content = self._get_content_from_file(file_path=file)
File "/home/george/ocw-data-parser-venv/lib/python3.8/site-packages/webvtt_py-0.4.6-py3.8.egg/webvtt/parsers.py", line 40, in _get_content_from_file
return self._read_content_lines(f)
File "/home/george/ocw-data-parser-venv/lib/python3.8/site-packages/webvtt_py-0.4.6-py3.8.egg/webvtt/parsers.py", line 54, in _read_content_lines
lines = [line.rstrip('\n\r') for line in file_obj.readlines()]
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbd in position 706: invalid start byte
I found an error running ocw-data-parser on a course while converting the subtitle content. To reproduce, convert the course at
PROD/physics
inocw-content-storage
Note: This is a different error than #150. That deals with a VTT conversion error which is caught and logged, but this is a Unicode error which is not caught.
Stacktrace