Closed j1ajunzhu closed 1 year ago
Initially, I attempted to read the file with the following code:
import pandas as pd
file_path = '/workspace/clang8/output_data/clang8_source_target_en.spacy_tokenized.tsv'
data = pd.read_csv(file_path , sep='\t', encoding='UTF-8',header = None, quotechar="\0")
The issue can now closed
Issue Title: Difficulty Reading Large TSV File with pandas due to Tokenization Error
Description
I'm trying to read a large TSV file into a pandas DataFrame, but encountering a tokenization error during the process. The file in question is
clang8/output_data/clang8_source_target_en.spacy_tokenized.tsv
, and it is confirmed to have 2,372,119 rows. However, not all rows are being read successfully.I checked for tsv, it gives me right amount of lines
Code to Reproduce
Expected Behavior
The expected outcome is to have a DataFrame with 2,372,119 rows, each corresponding to a row in the TSV file.
Actual Behavior
The process is interrupted by a ParserError, and not all rows are loaded into the DataFrame. The error message is as follows:
Environment