google-research / tapas

End-to-end neural table-text understanding models.
Apache License 2.0
1.13k stars 216 forks source link

UnicodeDecodeError at the stage of fine-tuning a pre-trained SQA model #146

Closed Otax-kaz closed 2 years ago

Otax-kaz commented 2 years ago

I was studying table question answering and was interested in your research. Therefore, I tried to reproduce the experiment.

I'm using an Ubuntu container with Docker. The package installation and tox execution are complete. I downloaded the SQA dataset and the model tapas_sqa_inter_masklm_tiny_reset and tried fine tuning, but an error occurred.

Looking at the results of other people's executions on the Internet, I don't think there is any particular difference. Also, the dataset's tsv file doesn't seem to contain any weird characters.

This is the command I executed.↓

python3 tapas/run_task_main.py \
>   --task="SQA" \
>   --input_dir="data/SQA_Release_1" \
>   --output_dir="output_dir" \
>   --bert_vocab_file="tapas_sqa_inter_masklm_tiny_reset/vocab.txt" \
>   --mode="create_data"

This is the error that occurred.↓

Instructions for updating:
non-resource variables are not supported in the long term
Creating interactions ...
I1110 12:53:56.770750 140229047478016 run_task_main.py:192] Creating interactions ...
Traceback (most recent call last):
  File "tapas/run_task_main.py", line 908, in <module>
    app.run(main)
  File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "tapas/run_task_main.py", line 861, in main
    task_utils.create_interactions(task, FLAGS.input_dir, output_dir,
  File "/workspace/tapas/utils/task_utils.py", line 171, in create_interactions
    sqa_utils.create_interactions(
  File "/workspace/tapas/utils/sqa_utils.py", line 182, in create_interactions
    interaction_dict = _read_interactions(input_dir)
  File "/workspace/tapas/utils/sqa_utils.py", line 46, in _read_interactions
    interactions = interaction_utils.read_from_tsv_file(file_handle)
  File "/workspace/tapas/utils/interaction_utils.py", line 86, in read_from_tsv_file
    for row in csv.DictReader(file_handle, delimiter='\t'):
  File "/opt/conda/lib/python3.8/csv.py", line 110, in __next__
    self.fieldnames
  File "/opt/conda/lib/python3.8/csv.py", line 97, in fieldnames
    self._fieldnames = next(self.reader)
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 211, in __next__
    return self.next()
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 205, in next
    retval = self.readline()
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 170, in readline
    return self._prepare_value(self._read_buf.readline())
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 93, in _prepare_value
    return compat.as_str_any(val)
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/util/compat.py", line 139, in as_str_any
    return as_str(value)
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/util/compat.py", line 118, in as_str
    return as_text(bytes_or_text, encoding)
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/util/compat.py", line 109, in as_text
    return bytes_or_text.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 37: invalid start byte

I'm sorry it's hard to understand. I would appreciate it if you could answer.

SyrineKrichene commented 2 years ago

Hi,

I'm trying to reproduce the error but I couldn't. Can you check your input files (SQA data) if they are correctly loaded? Also can you check which file/files outputs this error: can you try putting one by one in the input directory?

Can you find the character that cannot be read: It happened to me when downloading (using different machine type) some public datasets (different from SQA) that the escape character is coded in a wrong way. I suspect that you have one character that is badly encoded. If you know what it is you can change it to the correct one.

Thanks, Syrine

On Wed, Nov 10, 2021 at 2:25 PM Kazunari Ota @.***> wrote:

I was studying table question answering and was interested in your research. Therefore, I tried to reproduce the experiment.

I'm using an Ubuntu container with Docker. The package installation and tox execution are complete. I downloaded the SQA dataset and the model tapas_sqa_inter_masklm_tiny_reset and tried fine tuning, but an error occurred.

Looking at the results of other people's executions on the Internet, I don't think there is any particular difference. Also, the dataset's tsv file doesn't seem to contain any weird characters.

This is the command I executed.↓

python3 tapas/run_task_main.py \

--task="SQA" \

--input_dir="data/SQA_Release_1" \

--output_dir="output_dir" \

--bert_vocab_file="tapas_sqa_inter_masklm_tiny_reset/vocab.txt" \

--mode="create_data"

This is the error that occurred.↓

Instructions for updating:

non-resource variables are not supported in the long term

Creating interactions ...

I1110 12:53:56.770750 140229047478016 run_task_main.py:192] Creating interactions ...

Traceback (most recent call last):

File "tapas/run_task_main.py", line 908, in

app.run(main)

File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 303, in run

_run_main(main, args)

File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 251, in _run_main

sys.exit(main(argv))

File "tapas/run_task_main.py", line 861, in main

task_utils.create_interactions(task, FLAGS.input_dir, output_dir,

File "/workspace/tapas/utils/task_utils.py", line 171, in create_interactions

sqa_utils.create_interactions(

File "/workspace/tapas/utils/sqa_utils.py", line 182, in create_interactions

interaction_dict = _read_interactions(input_dir)

File "/workspace/tapas/utils/sqa_utils.py", line 46, in _read_interactions

interactions = interaction_utils.read_from_tsv_file(file_handle)

File "/workspace/tapas/utils/interaction_utils.py", line 86, in read_from_tsv_file

for row in csv.DictReader(file_handle, delimiter='\t'):

File "/opt/conda/lib/python3.8/csv.py", line 110, in next

self.fieldnames

File "/opt/conda/lib/python3.8/csv.py", line 97, in fieldnames

self._fieldnames = next(self.reader)

File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 211, in next

return self.next()

File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 205, in next

retval = self.readline()

File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 170, in readline

return self._prepare_value(self._read_buf.readline())

File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 93, in _prepare_value

return compat.as_str_any(val)

File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/util/compat.py", line 139, in as_str_any

return as_str(value)

File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/util/compat.py", line 118, in as_str

return as_text(bytes_or_text, encoding)

File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/util/compat.py", line 109, in as_text

return bytes_or_text.decode(encoding)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 37: invalid start byte

I'm sorry it's hard to understand. I would appreciate it if you could answer.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google-research/tapas/issues/146, or unsubscribe https://github.com/notifications/unsubscribe-auth/APARZOK7XP5R6MY2MK5KJV3ULJXDJANCNFSM5HX4RT3A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Otax-kaz commented 2 years ago

Thanks for your answer.

When I was checking the processing for the input file, I found that this program was reading the invisible file. The invisible file is "._{original filename}" and the encoding seems to be "Windows-1252". When I ran the python program after removing this file from "input_dir", the program worked without error.

I wasn't paying attention to the fact that I downloaded SQA file on a MacOS machine. I'm sorry I was confused by this lack of confirmation.

I appreciate it very much!!