KeyError: 'cardinality' while running Trainer

jinzzasol commented 1 year ago

System Info

Google Colab

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I encountered this issue while I was running the model. The dataset is an IMDB movie review on Kaggle.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py](https://localhost:8080/#) in __getattr__(self, item)
    265         try:
--> 266             return self.data[item]
    267         except KeyError:

KeyError: 'cardinality'

During handling of the above exception, another exception occurred:

Below is the code:

Data Cleaning

imdb = pd.read_csv("IMDB Dataset.csv")
df_imdb = imdb.copy()
df_imdb = df_imdb.replace({'sentiment': {'positive': 1, 'negative': 0}})
df_imdb.drop_duplicates(keep='first', inplace=True)
from sklearn.model_selection import train_test_split
train, test = train_test_split(df_cleaned, test_size=0.4, shuffle=False)
val, test = train_test_split(test, test_size=0.5, shuffle=True)

train = train.reset_index(drop=True)
val = val.reset_index(drop=True)
test = test.reset_index(drop=True)

train.shape, val.shape, test.shape,

Preprocessing

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Create new index
train_idx = [i for i in range(len(train.index))]
test_idx = [i for i in range(len(test.index))]
val_idx = [i for i in range(len(val.index))]

# Convert to numpy
x_train = train['review'].values[train_idx]
x_test = test['review'].values[test_idx]
x_val = val['review'].values[val_idx]

y_train = train['sentiment'].values[train_idx]
y_test = test['sentiment'].values[test_idx]
y_val = val['sentiment'].values[val_idx]

# Tokenize datasets
train_tokenized = tokenizer(list(x_train), return_tensors='tf', truncation=True, padding=True, max_length=128)
val_tokenized = tokenizer(list(x_val), return_tensors='tf', truncation=True, padding=True, max_length=128)
test_tokenized = tokenizer(list(x_test), return_tensors='tf', truncation=True, padding=True, max_length=128)

from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased")

model.compile(optimizer=optimizer)  # No loss argument!

from transformers import TFTrainer, TFTrainingArguments

training_args = TFTrainingArguments(
    output_dir="./sentiment_model",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    evaluation_strategy="steps",
    eval_steps=500,  # Adjust as needed
    save_total_limit=2,
)

trainer = TFTrainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=val_tokenized,
)

trainer.train() # **<- where the error occured**

Expected behavior

trainer.train() should run.

ydshieh commented 1 year ago

Hi, thank you for opening the issue.

Could you please share your system info with us. You can run the command transformers-cli env and copy-paste its output below.

jinzzasol commented 1 year ago

@ydshieh Sorry but I'm using Google Colab and I'm not able to run a command in Colab. It is a Pro feature.

ydshieh commented 1 year ago

Just type !transformers-cli env no?

Otherwise share the colab notebook maybe?

jinzzasol commented 1 year ago

Ok, I just found out that I should install transformers again before running it.

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- `transformers` version: 4.34.0
- Platform: Linux-5.15.120+-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.17.3
- Safetensors version: 0.4.0
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): 2.0.1+cu118 (False)
- Tensorflow version (GPU?): 2.13.0 (False)
- Flax version (CPU?/GPU?/TPU?): 0.7.4 (cpu)
- Jax version: 0.4.16
- JaxLib version: 0.4.16
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

I0000 00:00:1696604507.367394    1255 tfrt_cpu_pjrt_client.cc:352] TfrtCpuClient destroyed.

ydshieh commented 1 year ago

Thank you.

@Rocketknight1 could you take a look here?

jinzzasol commented 1 year ago

@ydshieh Just to tell you, I ran the same code on my local machine and I encountered the same issue. Below is the OS info.

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- `transformers` version: 4.33.3
- Platform: Linux-5.15.90.2-microsoft-standard-WSL2-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.17.3
- Safetensors version: 0.3.3
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): not installed (NA)
- Tensorflow version (GPU?): 2.14.0 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Rocketknight1 commented 1 year ago

Woah, this is a blast from the past! TFTrainer is very old and completely deprecated now, and we don't support it anymore. We generally advise people to just use the Keras API for TF.

You can keep most of your code the same up to the model.compile() line, and then on the next line I'd just do something like this:

model.fit(train_tokenized, y_train, validation_data=(val_tokenized, y_val), epochs=3)

For more info on training Hugging Face models with TF, please see our TensorFlow Philosophy post, or any of the Keras documentation, particularly the docs on supported dataset types and model.fit() - you can find them here.

ydshieh commented 1 year ago

I forgot it's TFTrainer. Sorry @Rocketknight1 !

Rocketknight1 commented 1 year ago

No problem, it was a nice nostalgia moment!

jinzzasol commented 1 year ago

@Rocketknight1 @ydshieh Thank you all. I was referring to one post I found and did not know this was deprecated.

jinzzasol commented 1 year ago

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-32-021f97abfd29>](https://localhost:8080/#) in <cell line: 1>()
----> 1 model.fit(train_tokenized, y_train, validation_data=(val_tokenized, y_val), epochs=3)

3 frames
[/usr/local/lib/python3.10/dist-packages/tensorflow/core/function/polymorphism/function_type.py](https://localhost:8080/#) in __hash__(self)
    144 
    145   def __hash__(self):
--> 146     return hash((self.name, self.kind, self.optional, self.type_constraint))
    147 
    148   def __repr__(self):

ValueError: Cannot generate a hashable key for IteratorSpec(({'input_ids': TensorSpec(shape=(None, 128), dtype=tf.int32, name=None), 'attention_mask': TensorSpec(shape=(None, 128), dtype=tf.int32, name=None)}, TensorSpec(shape=(None,), dtype=tf.int64, name=None)),) because the _serialize() method returned an unsupproted value of type <class 'transformers.tokenization_utils_base.BatchEncoding'>

I ran the model.fit() and got this error message. I think this is a different topic but any ideas?

Rocketknight1 commented 1 year ago

Yeah, that's a regular problem we have! Just do train_tokenized = dict(train_tokenized) before passing the data to model.fit() - the output data is a BatchEncoding that Keras doesn't quite understand.

One day I'll figure out a cleaner solution for it, but I'll probably have to slip a couple of shims into Keras's methods!

jinzzasol commented 1 year ago

Thank you!

huggingface / transformers