Error when loading gold file

valsworthen commented 5 years ago

Hello,

I am trying to generate predictions from the sample of the development set provided here. However, the script fails upon loading the gold data.

Input: python natural_questions/make_test_data.py --gold_path sample/v1.0_sample_nq-dev-sample.jsonl.gz --output_path whatever

Error:

I0201 17:10:07.177878 140447407568704 eval_utils.py:261] parsing sample/v1.0_sample_nq-dev-sample.jsonl.gz ..... 
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "XXX/miniconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "XXX/miniconda3/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "XXX/natural-questions/eval_utils.py", line 264, in read_annotation_from_one_split
    for line in input_file:
  File "XXX/miniconda3/lib/python3.7/gzip.py", line 374, in readline
    return self._buffer.readline(size)
  File "XXX/miniconda3/lib/python3.7/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "XXX/miniconda3/lib/python3.7/gzip.py", line 463, in read
    if not self._read_gzip_header():
  File "XXX/miniconda3/lib/python3.7/gzip.py", line 406, in _read_gzip_header
    magic = self._fp.read(2)
  File "XXX/miniconda3/lib/python3.7/gzip.py", line 91, in read
    self.file.read(size-self._length+read)
  File "XXX/miniconda3/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "natural-questions/make_test_data.py", line 120, in <module>
    app.run(main)
  File "XXX/miniconda3/lib/python3.7/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "XXX/miniconda3/lib/python3.7/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "natural-questions/make_test_data.py", line 53, in main
    n_threads=FLAGS.num_threads)
  File "XXX/natural-questions/eval_utils.py", line 303, in read_annotation
    dict_list = pool.map(read_annotation_from_one_split, input_paths)
  File "XXX/miniconda3/lib/python3.7/multiprocessing/pool.py", line 290, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "XXX/miniconda3/lib/python3.7/multiprocessing/pool.py", line 683, in get
    raise self._value

Specifically it fails at line 264 of eval_utils.py when trying to iterate over the gold file.

Can you help me solve this issue?

Thanks a lot!

bozheng-hit commented 5 years ago

I also have the same problem. The following operations may work:

Replace all the iteritems() with items() in nq.eval.py.
Change open(gzipped_input_file) to open(gzipped_input_file, "rb") in eval_utils.py.

valsworthen commented 5 years ago

Indeed this fixes the problem, thanks.

google-research-datasets / natural-questions

Error when loading gold file #4