mitodl / ocw-data-parser

A parsing script for MIT OpenCourseWare course data
0 stars 0 forks source link

Error converting subtitle halts parse_all #151

Closed noisecapella closed 3 years ago

noisecapella commented 3 years ago

I found an error running ocw-data-parser on a course while converting the subtitle content. To reproduce, convert the course at PROD/physics in ocw-content-storage

Note: This is a different error than #150. That deals with a VTT conversion error which is caught and logged, but this is a Unicode error which is not caught.

Stacktrace

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/george/Projects/ocw-data-parser/ocw_data_parser/utils.py", line 266, in parse_all
    parser = ocw_data_parser.OCWParser(
  File "/home/george/Projects/ocw-data-parser/ocw_data_parser/ocw_data_parser.py", line 468, in __init__
    self.populate_vtt_files()
  File "/home/george/Projects/ocw-data-parser/ocw_data_parser/ocw_data_parser.py", line 943, in populate_vtt_files
    new_json = convert_to_vtt(loaded_json)
  File "/home/george/Projects/ocw-data-parser/ocw_data_parser/utils.py", line 317, in convert_to_vtt
    webvtt.from_srt(Path(temp_dir) / "data").save()
  File "/home/george/ocw-data-parser-venv/lib/python3.8/site-packages/webvtt_py-0.4.6-py3.8.egg/webvtt/webvtt.py", line 48, in from_srt
    parser = SRTParser().read(file)
  File "/home/george/ocw-data-parser-venv/lib/python3.8/site-packages/webvtt_py-0.4.6-py3.8.egg/webvtt/parsers.py", line 24, in read
    content = self._get_content_from_file(file_path=file)
  File "/home/george/ocw-data-parser-venv/lib/python3.8/site-packages/webvtt_py-0.4.6-py3.8.egg/webvtt/parsers.py", line 40, in _get_content_from_file
    return self._read_content_lines(f)
  File "/home/george/ocw-data-parser-venv/lib/python3.8/site-packages/webvtt_py-0.4.6-py3.8.egg/webvtt/parsers.py", line 54, in _read_content_lines
    lines = [line.rstrip('\n\r') for line in file_obj.readlines()]
  File "/usr/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbd in position 706: invalid start byte
pdpinch commented 3 years ago

Note that the courses in PROD/physics are "High School" courses from https://ocw.mit.edu/high-school/physics/