google-research-datasets / clang8

cLang-8 is a dataset for grammatical error correction.
100 stars 5 forks source link

pandas load clang8_source_target_en.spacy_tokenized.tsv file issue #16

Closed j1ajunzhu closed 10 months ago

j1ajunzhu commented 10 months ago

Issue Title: Difficulty Reading Large TSV File with pandas due to Tokenization Error

Description

I'm trying to read a large TSV file into a pandas DataFrame, but encountering a tokenization error during the process. The file in question is clang8/output_data/clang8_source_target_en.spacy_tokenized.tsv, and it is confirmed to have 2,372,119 rows. However, not all rows are being read successfully.

I checked for tsv, it gives me right amount of lines

wc -l /workspace/clang8/targets/clang8_en.detokenized.tsv
2372119 /workspace/clang8/targets/clang8_en.detokenized.tsv

Code to Reproduce

import pandas as pd

file_path = '/workspace/clang8/output_data/clang8_source_target_en.spacy_tokenized.tsv'
data = pd.read_csv(file_path, sep='\t', encoding='utf-8')
print(len(data))

Expected Behavior

The expected outcome is to have a DataFrame with 2,372,119 rows, each corresponding to a row in the TSV file.

Actual Behavior

The process is interrupted by a ParserError, and not all rows are loaded into the DataFrame. The error message is as follows:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/miniconda3/envs/nanoT5/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/root/miniconda3/envs/nanoT5/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 583, in _read
    return parser.read(nrows)
  File "/root/miniconda3/envs/nanoT5/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1704, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/root/miniconda3/envs/nanoT5/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 875, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 850, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._check_tokenize_status
  File "pandas/_libs/parsers.pyx", line 2029, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 2 fields in line 3955, saw 3

Environment

j1ajunzhu commented 10 months ago

Attempted Solutions

Initially, I attempted to read the file with the following code:

import pandas as pd

file_path = '/workspace/clang8/output_data/clang8_source_target_en.spacy_tokenized.tsv'
data = pd.read_csv(file_path , sep='\t', encoding='UTF-8',header = None, quotechar="\0")

The issue can now closed