utterworks / fast-bert

Super easy library for BERT based NLP models
Apache License 2.0
1.87k stars 341 forks source link

DtypeWarning: Columns (0,1) have mixed types. Specify dtype option on import or set low_memory=False #311

Open mathieuchateau opened 2 years ago

mathieuchateau commented 2 years ago

Hello, I am quite new on the topic, sorry if it's a false issue.

When loading with BertDataBunch, I got this warning:

lib/python3.9/site-packages/fast_bert/data_cls.py:231: DtypeWarning: Columns (0,1) have mixed types. Specify dtype option on import or set low_memory=False.
  data_df = pd.read_csv(os.path.join(self.data_dir, filename))

I already have this sort of issue with panda in my code, but with BertDataBunch I can't find a way to set dtype option ? Installed fast-bert yesterday, so latest version I guess

databunch = BertDataBunch(DATA_PATH, LABEL_PATH,
                              tokenizer='camembert-base',
                              train_file='train_set.csv',
                              val_file='val_set.csv',
                              label_file='labels.txt',
                              text_col='source_clean',
                              label_col=['aaa', 'bbb', 'ccc','ddd', 'eee'],
                              batch_size_per_gpu=16,
                              max_seq_length=512,
                              multi_gpu=False,
                              multi_label=True,
                              model_type='camembert-base')
mathieuchateau commented 2 years ago

Second warning during same run on another line (248):

lib/python3.9/site-packages/fast_bert/data_cls.py:248: DtypeWarning: Columns (0,1) have mixed types. Specify dtype option on import or set low_memory=False.
  data_df = pd.read_csv(os.path.join(self.data_dir, filename))
lingdoc commented 2 years ago

this is related to the format of your datafiles, which can lead to issues when importing a CSV via a pandas dataframe. I might submit a pull request to allow xlsx files instead, since these have better handling for rows/columns, but for now one workaround is to ensure all your text in a CSV is surrounded by double quotes: "