bug in transformers notebook (training from scratch)? #13632

Closed randomgambit closed 2 years ago

randomgambit commented 3 years ago

Hello there!

First of all, I cannot thank @Rocketknight1 enough for the amazing work he has been doing to create tensorflow versions of the notebooks. On my side, I have spent some time and money (colab pro) trying to tie the notebooks together to create a full classifier from scratch with the following steps:

  1. train the tokenizer
  2. train the language model
  3. train de classification head.

Unfortunately, I run into two issues. You can use the fully working notebook pasted below.

First issue: by training my own tokenizer I actually get a perplexity (225) that is way worse than the example shown when using

model_checkpoint = "bert-base-uncased"
datasets = load_dataset("wikitext", "wikitext-2-raw-v1")

This is puzzling as the tokenizer should be fine-tuned to the data used in the original tf2 notebook!

Second, there seem to be some python issue when I try to fine-tune the language model I obtained above with a text classification head.

Granted, the tokenizer and the underlying language model have been trained on another dataset (the wikipedia dataset from the previous two tf2 notebook that is). See . However, I should at least get some valid output! Here the model is complaining about some collate function.

Could you please have a look @sgugger @LysandreJik @Rocketknight1 when you can? I would be very happy to contribute this notebook to the Hugging Face community (although most of the credits go to @Rocketknight1). There is a great demand for building language models and NLP tasks from scratch.


Code below

get the most recent versions

!pip install git+
!pip install  transformers

train tokenizer from scratch

from datasets import load_dataset
dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")
batch_size = 1000

def batch_iterator():
    for i in range(0, len(dataset), batch_size):
        yield dataset[i : i + batch_size]["text"]

all_texts = [dataset[i : i + batch_size]["text"] for i in range(0, len(dataset), batch_size)]
from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer

tokenizer = Tokenizer(models.WordPiece(unl_token="[UNK]"))
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()

special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]

trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)
tokenizer.train_from_iterator(batch_iterator(), trainer=trainer)

cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")
print(cls_token_id, sep_token_id)

tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
        ("[CLS]", cls_token_id),
        ("[SEP]", sep_token_id),
tokenizer.decoder = decoders.WordPiece(prefix="##")

from transformers import BertTokenizerFast
mytokenizer = BertTokenizerFast(tokenizer_object=tokenizer)

causal language from scratch using my own tokenizer mytokenizer

model_checkpoint = "bert-base-uncased"
datasets = load_dataset("wikitext", "wikitext-2-raw-v1")

def tokenize_function(examples):
    return mytokenizer(examples["text"], truncation=True)

tokenized_datasets =
    tokenize_function, batched=True, num_proc=4, remove_columns=["text"]

block_size = 128

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    result["labels"] = result["input_ids"].copy()
    return result

lm_datasets =

from transformers import TFAutoModelForMaskedLM
model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint)

from transformers import create_optimizer, AdamWeightDecay
import tensorflow as tf

optimizer = AdamWeightDecay(lr=2e-5, weight_decay_rate=0.01)

def dummy_loss(y_true, y_pred):
    return tf.reduce_mean(y_pred)

model.compile(optimizer=optimizer, loss={"loss": dummy_loss})

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=mytokenizer, mlm_probability=0.15, return_tensors="tf"

train_set = lm_datasets["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],

validation_set = lm_datasets["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
), validation_data=validation_set, epochs=1)
import math

eval_results = model.evaluate(validation_set)[0]
print(f"Perplexity: {math.exp(eval_results):.2f}")

and fine tune a classification tasks


task = "sst2"
batch_size = 16

from datasets import load_dataset, load_metric

actual_task = "mnli" if task == "mnli-mm" else task
dataset = load_dataset("glue", actual_task)
metric = load_metric("glue", actual_task)

and now try to classify text

from transformers import AutoTokenizer

task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

def preprocess_function(examples):
    if sentence2_key is None:
        return mytokenizer(examples[sentence1_key], truncation=True)
    return mytokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

pre_tokenizer_columns = set(dataset["train"].features)
encoded_dataset =, batched=True)
tokenizer_columns = list(set(encoded_dataset["train"].features) - pre_tokenizer_columns)
print("Columns added by tokenizer:", tokenizer_columns)

validation_key = (
    if task == "mnli-mm"
    else "validation_matched"
    if task == "mnli"
    else "validation"

tf_train_dataset = encoded_dataset["train"].to_tf_dataset(
tf_validation_dataset = encoded_dataset[validation_key].to_tf_dataset(

from transformers import TFAutoModelForSequenceClassification
import tensorflow as tf

num_labels = 3 if task.startswith("mnli") else 1 if task == "stsb" else 2

if task == "stsb":
    loss = tf.keras.losses.MeanSquaredError()
    num_labels = 1
elif task.startswith("mnli"):
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    num_labels = 3
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    num_labels = 2

model = TFAutoModelForSequenceClassification.from_pretrained(
    model, num_labels=num_labels

from transformers import create_optimizer

num_epochs = 5

batches_per_epoch = len(encoded_dataset["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)

optimizer, schedule = create_optimizer(
    init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps
model.compile(optimizer=optimizer, loss=loss)

metric_name = (
    if task == "stsb"
    else "matthews_correlation"
    if task == "cola"
    else "accuracy"

def compute_metrics(predictions, labels):
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

predictions = model.predict(tf_validation_dataset)["logits"]
compute_metrics(predictions, np.array(encoded_dataset[validation_key]["label"]))

Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-d01ad7112f932f9c.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-de5efda680a1f856.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-0f3c1e00b7f03ba8.arrow
Sentence: hide new secretions from the parental units 
Columns added by tokenizer: ['attention_mask', 'input_ids', 'token_type_ids']
VisibleDeprecationWarning                 Traceback (most recent call last)
<ipython-input-42-6eba4122302c> in <module>()
     44     shuffle=True,
     45     batch_size=16,
---> 46     collate_fn=mytokenizer.pad,
     47 )
     48 tf_validation_dataset = encoded_dataset[validation_key].to_tf_dataset(

9 frames
/usr/local/lib/python3.7/dist-packages/datasets/formatting/ in _arrow_array_to_numpy(self, pa_array)
    165             # cast to list of arrays or we end up with a np.array with dtype object
    166             array: List[np.ndarray] = pa_array.to_numpy(zero_copy_only=zero_copy_only).tolist()
--> 167         return np.array(array, copy=False, **self.np_array_kwargs)

VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray

What do you think? Happy to help if I can Thanks!!

sgugger commented 3 years ago

For the first issue you are training from scratch a new model versus fine-tuning one that has been pretrained on way more data. It's completely normal that the latter wins. As for the second one, I'm not sure you can directly use the tokenizer.pad method as a collation function.

Note that since you are copying the error messages, you should expand the intermediate frames so we can see where the error comes from.

randomgambit commented 3 years ago

thanks @sgugger could you please clarify what you mean by

As for the second one, I'm not sure you can directly use the tokenizer.pad method as a collation function.

The call

tf_train_dataset = encoded_dataset["train"].to_tf_dataset(

comes directly from the official tf2 notebook

randomgambit commented 3 years ago

expanded error here, thanks!

Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-d01ad7112f932f9c.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-de5efda680a1f856.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-0f3c1e00b7f03ba8.arrow
Sentence: hide new secretions from the parental units 
{'input_ids': [[2, 11384, 1363, 3215, 1325, 1218, 1125, 10341, 1139, 3464, 3], [2, 4023, 1491, 15755, 16, 1520, 4610, 1128, 13221, 802, 3], [2, 1187, 13755, 1327, 2845, 1142, 18920, 802, 4245, 3168, 7806, 1542, 2569, 3796, 3], [2, 3419, 22353, 13782, 1145, 3802, 1125, 1913, 2493, 3], [2, 1161, 1125, 6802, 11823, 17, 1137, 17, 1125, 17, 1233, 3765, 802, 1305, 18029, 802, 1125, 21157, 1843, 14645, 1280, 1427, 3]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
Columns added by tokenizer: ['attention_mask', 'input_ids', 'token_type_ids']
ClassLabel(num_classes=2, names=['negative', 'positive'], names_file=None, id=None)
VisibleDeprecationWarning                 Traceback (most recent call last)
<ipython-input-56-ddb32272e3ba> in <module>()
     47     shuffle=True,
     48     batch_size=16,
---> 49     collate_fn=mytokenizer.pad,
     50 )
     51 tf_validation_dataset = encoded_dataset[validation_key].to_tf_dataset(

9 frames
/usr/local/lib/python3.7/dist-packages/datasets/ in to_tf_dataset(self, columns, batch_size, shuffle, drop_remainder, collate_fn, collate_fn_args, label_cols, dummy_labels, prefetch)
    349             return [tf.convert_to_tensor(arr) for arr in out_batch]
--> 351         test_batch = np_get_batch(np.arange(batch_size))
    353         @tf.function(input_signature=[tf.TensorSpec(None, tf.int64)])

/usr/local/lib/python3.7/dist-packages/datasets/ in np_get_batch(indices)
    324         def np_get_batch(indices):
--> 325             batch = dataset[indices]
    326             out_batch = []
    327             if collate_fn is not None:

/usr/local/lib/python3.7/dist-packages/datasets/ in __getitem__(self, key)
   1780             format_columns=self._format_columns,
   1781             output_all_columns=self._output_all_columns,
-> 1782             format_kwargs=self._format_kwargs,
   1783         )

/usr/local/lib/python3.7/dist-packages/datasets/ in _getitem(self, key, format_type, format_columns, output_all_columns, format_kwargs)
   1769         pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
   1770         formatted_output = format_table(
-> 1771             pa_subtable, key, formatter=formatter, format_columns=format_columns, output_all_columns=output_all_columns
   1772         )
   1773         return formatted_output

/usr/local/lib/python3.7/dist-packages/datasets/formatting/ in format_table(table, key, formatter, format_columns, output_all_columns)
    420     else:
    421         pa_table_to_format = pa_table.drop(col for col in pa_table.column_names if col not in format_columns)
--> 422         formatted_output = formatter(pa_table_to_format, query_type=query_type)
    423         if output_all_columns:
    424             if isinstance(formatted_output, MutableMapping):

/usr/local/lib/python3.7/dist-packages/datasets/formatting/ in __call__(self, pa_table, query_type)
    196             return self.format_column(pa_table)
    197         elif query_type == "batch":
--> 198             return self.format_batch(pa_table)
    200     def format_row(self, pa_table: pa.Table) -> RowFormat:

/usr/local/lib/python3.7/dist-packages/datasets/formatting/ in format_batch(self, pa_table)
    242     def format_batch(self, pa_table: pa.Table) -> dict:
--> 243         return self.numpy_arrow_extractor(**self.np_array_kwargs).extract_batch(pa_table)

/usr/local/lib/python3.7/dist-packages/datasets/formatting/ in extract_batch(self, pa_table)
    153     def extract_batch(self, pa_table: pa.Table) -> dict:
--> 154         return {col: self._arrow_array_to_numpy(pa_table[col]) for col in pa_table.column_names}
    156     def _arrow_array_to_numpy(self, pa_array: pa.Array) -> np.ndarray:

/usr/local/lib/python3.7/dist-packages/datasets/formatting/ in <dictcomp>(.0)
    153     def extract_batch(self, pa_table: pa.Table) -> dict:
--> 154         return {col: self._arrow_array_to_numpy(pa_table[col]) for col in pa_table.column_names}
    156     def _arrow_array_to_numpy(self, pa_array: pa.Array) -> np.ndarray:

/usr/local/lib/python3.7/dist-packages/datasets/formatting/ in _arrow_array_to_numpy(self, pa_array)
    165             # cast to list of arrays or we end up with a np.array with dtype object
    166             array: List[np.ndarray] = pa_array.to_numpy(zero_copy_only=zero_copy_only).tolist()
--> 167         return np.array(array, copy=False, **self.np_array_kwargs)

VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
sgugger commented 3 years ago

I'm sure @Rocketknight1 will know what's going on here :-)

randomgambit commented 3 years ago

waiting for @Rocketknight1 then! Thanks

randomgambit commented 3 years ago

@Rocketknight1 @sgugger interestingly running the same notebook today (with the new pip install that is) returns another error

Not sure what the issue is this time... Any ideas? Thanks!

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Sentence: hide new secretions from the parental units 
{'input_ids': [[2, 11384, 1363, 3215, 1325, 1218, 1125, 10341, 1139, 3464, 3], [2, 4023, 1491, 15755, 16, 1520, 4610, 1128, 13221, 798, 3], [2, 1187, 13755, 1327, 2845, 1142, 18920, 798, 4245, 3168, 7806, 1542, 2569, 3796, 3], [2, 3419, 22351, 13782, 1145, 3802, 1125, 1913, 2493, 3], [2, 1161, 1125, 6802, 11823, 17, 1137, 17, 1125, 17, 1233, 3765, 798, 1305, 18030, 798, 1125, 21156, 1843, 14645, 1280, 1427, 3]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
68/68 [00:04<00:00, 20.16ba/s]
1/1 [00:00<00:00, 10.70ba/s]
2/2 [00:00<00:00, 13.42ba/s]
Columns added by tokenizer: ['token_type_ids', 'input_ids', 'attention_mask']
ClassLabel(num_classes=2, names=['negative', 'positive'], names_file=None, id=None)
/usr/local/lib/python3.7/dist-packages/datasets/formatting/ VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return np.array(array, copy=False, **self.np_array_kwargs)
404 Client Error: Not Found for url:
HTTPError                                 Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/transformers/ in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
    553                 use_auth_token=use_auth_token,
--> 554                 user_agent=user_agent,
    555             )

6 frames
/usr/local/lib/python3.7/dist-packages/transformers/ in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, use_auth_token, local_files_only)
   1409             use_auth_token=use_auth_token,
-> 1410             local_files_only=local_files_only,
   1411         )

/usr/local/lib/python3.7/dist-packages/transformers/ in get_from_cache(url, cache_dir, force_download, proxies, etag_timeout, resume_download, user_agent, use_auth_token, local_files_only)
   1573             r = requests.head(url, headers=headers, allow_redirects=False, proxies=proxies, timeout=etag_timeout)
-> 1574             r.raise_for_status()
   1575             etag = r.headers.get("X-Linked-Etag") or r.headers.get("ETag")

/usr/local/lib/python3.7/dist-packages/requests/ in raise_for_status(self)
    940         if http_error_msg:
--> 941             raise HTTPError(http_error_msg, response=self)

HTTPError: 404 Client Error: Not Found for url:

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
<ipython-input-6-ddb32272e3ba> in <module>()
     74 model = TFAutoModelForSequenceClassification.from_pretrained(
---> 75     model, num_labels=num_labels
     76 )

/usr/local/lib/python3.7/dist-packages/transformers/models/auto/ in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
    395         if not isinstance(config, PretrainedConfig):
    396             config, kwargs = AutoConfig.from_pretrained(
--> 397                 pretrained_model_name_or_path, return_unused_kwargs=True, **kwargs
    398             )
    399         if hasattr(config, "auto_map") and cls.__name__ in config.auto_map:

/usr/local/lib/python3.7/dist-packages/transformers/models/auto/ in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
    525         """
    526         kwargs["_from_auto"] = True
--> 527         config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
    528         if "model_type" in config_dict:
    529             config_class = CONFIG_MAPPING[config_dict["model_type"]]

/usr/local/lib/python3.7/dist-packages/transformers/ in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
    568                 msg += f"- or '{revision}' is a valid git identifier (branch name, a tag name, or a commit id) that exists for this model name as listed on its model page on ''\n\n"
--> 570             raise EnvironmentError(msg)
    572         except json.JSONDecodeError:

OSError: Can't load config for '<transformers.models.bert.modeling_tf_bert.TFBertForMaskedLM object at 0x7f1f29039850>'. Make sure that:

- '<transformers.models.bert.modeling_tf_bert.TFBertForMaskedLM object at 0x7f1f29039850>' is a correct model identifier listed on ''

- or '<transformers.models.bert.modeling_tf_bert.TFBertForMaskedLM object at 0x7f1f29039850>' is the correct path to a directory containing a config.json file
Rocketknight1 commented 3 years ago

Hi @randomgambit, sorry for the lengthy delay in replying again! I'm still making changes to some of the lower-level parts of the library, so these notebooks haven't been fully finalized yet.

The VisibleDeprecationWarning in your first post is something that will hopefully be fixed by upcoming changes to datasets, but for now you can just ignore it.

The error you're getting in your final post is, I think, caused by you overwriting the variable model in your code. The from_pretrained() method expects a string like bert-base-cased, but it seems like you've created an actual TF model with that variable name. If you pass an actual model object to from_pretrained() it'll get very confused - so make sure that whatever argument you're passing there is a string and not something else!

randomgambit commented 3 years ago

thanks @Rocketknight1, super useful as usual. So what you are saying is that I should have saved my tokenizer mytokenizer and my language model model using save_pretrained(), and then I need to load the model with a classification head using TFAutoModelForSequenceClassification, right?


model = TFAutoModelForSequenceClassification.from_pretrained(
    'mymodel', num_labels=num_labels

This seems to work. I will try to adapt the code so that both the tokenization and the language model are performed on the dataset actually used in the classidication task (dataset = load_dataset("glue", "sst2"). Do you mind having a look when i'm done? This will be a super useful notebook for everyone.


randomgambit commented 3 years ago

@Rocketknight1 @sgugger I can confirm the new TF notebook works beautifully! Thanks! Just a follow up though: I tried to fine-tune a longformer model and everything works smoothly until the call, where I get a cryptic message

This is the model I use:

task = "sst2"
model_checkpoint = "allenai/longformer-large-4096"
batch_size = 16

and then you can run the default notebook until you reach the end

Epoch 1/3
TypeError                                 Traceback (most recent call last)
<ipython-input-28-4075d9d9fb81> in <module>()
      3     tf_train_dataset,
      4     validation_data=tf_validation_dataset,
----> 5     epochs=3)

9 frames
/usr/local/lib/python3.7/dist-packages/keras/engine/ in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
   1182                 _r=1):
   1183               callbacks.on_train_batch_begin(step)
-> 1184               tmp_logs = self.train_function(iterator)
   1185               if data_handler.should_sync:
   1186                 context.async_wait()

/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/ in __call__(self, *args, **kwds)
    884       with OptionalXlaContext(self._jit_compile):
--> 885         result = self._call(*args, **kwds)
    887       new_tracing_count = self.experimental_get_tracing_count()

/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/ in _call(self, *args, **kwds)
    922       # In this case we have not created variables on the first call. So we can
    923       # run the first trace but we should fail if variables are created.
--> 924       results = self._stateful_fn(*args, **kwds)
    925       if self._created_variables and not ALLOW_DYNAMIC_VARIABLE_CREATION:
    926         raise ValueError("Creating variables on a non-first call to a function"

/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/ in __call__(self, *args, **kwargs)
   3036     with self._lock:
   3037       (graph_function,
-> 3038        filtered_flat_args) = self._maybe_define_function(args, kwargs)
   3039     return graph_function._call_flat(
   3040         filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access

/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/ in _maybe_define_function(self, args, kwargs)
   3458               call_context_key in self._function_cache.missed):
   3459             return self._define_function_with_shape_relaxation(
-> 3460                 args, kwargs, flat_args, filtered_flat_args, cache_key_context)
   3462           self._function_cache.missed.add(call_context_key)

/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/ in _define_function_with_shape_relaxation(self, args, kwargs, flat_args, filtered_flat_args, cache_key_context)
   3381     graph_function = self._create_graph_function(
-> 3382         args, kwargs, override_flat_arg_shapes=relaxed_arg_shapes)
   3383     self._function_cache.arg_relaxed[rank_only_cache_key] = graph_function

/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/ in _create_graph_function(self, args, kwargs, override_flat_arg_shapes)
   3306             arg_names=arg_names,
   3307             override_flat_arg_shapes=override_flat_arg_shapes,
-> 3308             capture_by_value=self._capture_by_value),
   3309         self._function_attributes,
   3310         function_spec=self.function_spec,

/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ in func_graph_from_py_func(name, python_func, args, kwargs, signature, func_graph, autograph, autograph_options, add_control_dependencies, arg_names, op_return_value, collections, capture_by_value, override_flat_arg_shapes, acd_record_initial_resource_uses)
   1005         _, original_func = tf_decorator.unwrap(python_func)
-> 1007       func_outputs = python_func(*func_args, **func_kwargs)
   1009       # invariant: `func_outputs` contains only Tensors, CompositeTensors,

/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/ in wrapped_fn(*args, **kwds)
    666         # the function a weak reference to itself to avoid a reference cycle.
    667         with OptionalXlaContext(compile_with_xla):
--> 668           out = weak_wrapped_fn().__wrapped__(*args, **kwds)
    669         return out

/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ in wrapper(*args, **kwargs)
    992           except Exception as e:  # pylint:disable=broad-except
    993             if hasattr(e, "ag_error_metadata"):
--> 994               raise e.ag_error_metadata.to_exception(e)
    995             else:
    996               raise

TypeError: in user code:

    /usr/local/lib/python3.7/dist-packages/keras/engine/ train_function  *
        return step_function(self, iterator)
    /usr/local/lib/python3.7/dist-packages/transformers/models/longformer/ call  *
        inputs["global_attention_mask"] = tf.tensor_scatter_nd_update(
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/util/ wrapper  **
        return target(*args, **kwargs)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/ tensor_scatter_nd_update
        tensor=tensor, indices=indices, updates=updates, name=name)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/ tensor_scatter_update
        updates=updates, name=name)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ _apply_op_helper

    TypeError: Input 'updates' of 'TensorScatterUpdate' Op has type int32 that does not match type int64 of argument 'tensor'.

Maybe there is something specific to longformer that does not work well with the current notebook? What do you all think?


randomgambit commented 3 years ago

@Rocketknight1 I know you are busy (and I cannot thank you enough for the magnificent TF notebooks!) but I wanted to let you know that I also have tried with allenai/longformer-base-4096 and I am getting the same int64 error. Please let me know if I can do anything to help you out.


randomgambit commented 3 years ago

Hi @Rocketknight1 I hope all is well!

I know wonder if longformer can be trained at all with this notebook. Indeed, I read that This notebook is built to run on any of the tasks in the list above, with any model checkpoint from the Model Hub as long as that model has a version with a classification head.

If so, could you please tell me which TF notebook I need to adapt to make it work? Thanks!!

jmwoloso commented 2 years ago

Have you found any solution @randomgambit? Running into this myself.

jmwoloso commented 2 years ago

i'll try passing in zeros cast to int32 to the global_attention_mask param to fit and see if that helps. the tf.zeros_like used by transformers to generate the mask (when none are passed in by the user) must default to int64?

jmwoloso commented 2 years ago

@randomgambit try the opposite of what I said above. You need to cast your input_ids to tf.int32. something like this should work:

input_ids = tf.convert_to_tensor([tf.convert_to_tensor(row, dtype=tf.int32) 
                                  for row in input_ids], dtype=tf.int32)

it would probably work via equivalent numpy methods, but I haven't tried that yet. the default dtype for tf.zeros_like is tf.int32 (transformers makes global_attention_mask using tf.zeros_like for you if you don't pass it in).

you could probably also create the global_attention_mask yourself as dtype tf.int64. point being i think they all just need to be the same type.

we can probably close this @Rocketknight1

randomgambit commented 2 years ago

thanks @jmwoloso, I initially didn't see your message. I am hoping @Rocketknight1 can just confirm all is good before closing... Thanks!

ichenjia commented 2 years ago

Ran into the same problem. I am totally lost.

Here is what I did

`import numpy as np my_dict = {'text': ["random text 1", "random text 2", "random text 3"], 'label': [0, 0, 1]}

from datasets import Dataset

dataset = Dataset.from_dict(my_dict)`

` from transformers import LongformerTokenizer, TFLongformerForSequenceClassification tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')

def tokenize_function(examples): r=tokenizer(examples["text"], padding="max_length", truncation=True) r['input_ids']= [tf.convert_to_tensor(row, dtype=tf.int32) for row in r['input_ids']] r['attention_mask']= [tf.convert_to_tensor(row, dtype=tf.int32) for row in r['attention_mask']]
return r

tokenized_datasets =, batched=True)

small_train_dataset = tokenized_datasets.shuffle(seed=42)

from transformers import DefaultDataCollator

data_collator = DefaultDataCollator(return_tensors="tf")

tf_train_dataset = small_train_dataset.to_tf_dataset( columns=["attention_mask", "input_ids", "token_type_ids"], label_cols=["labels"], shuffle=True, collate_fn=data_collator, batch_size=8, ), batch_size=1)


@randomgambit and @jmwoloso any ideas?

jmwoloso commented 2 years ago

@ichenjia There were a few errors mentioned throughout this thread. Which one are you seeing?

ichenjia commented 2 years ago

Thank you. It’s the last error related to int32 and int64

jmwoloso commented 2 years ago

@ichenjia Did you try my solution of casting your input_ids to tf.int32?

ichenjia commented 2 years ago

@ichenjia Did you try my solution of casting your input_ids to tf.int32?

Thank you. Here is what I did per the earlier tip from this thread

r['input_ids']= [tf.convert_to_tensor(row, dtype=tf.int32) for row in r['input_ids']] r['attention_mask']= [tf.convert_to_tensor(row, dtype=tf.int32)

In the tokenizer function mapped to dataset I still got that int32 error. Did I do something wrong?

ichenjia commented 2 years ago


After reading the source code of Dataset, I think the problem is in the to_tf_dataset function, which called

_get_output_signature LN 290-303

    if np.issubdtype(np_arrays[0].dtype, np.integer) or np_arrays[0].dtype == bool:
                tf_dtype = tf.int64
                np_dtype = np.int64
            elif np.issubdtype(np_arrays[0].dtype, np.number):
                tf_dtype = tf.float32
                np_dtype = np.float32
            elif np_arrays[0].dtype.kind == "U":  # Unicode strings
                np_dtype = np.unicode_
                tf_dtype = tf.string
                raise RuntimeError(
                    f"Unrecognized array dtype {np_arrays[0].dtype}. \n"
                    "Nested types and image/audio types are not supported yet."

It forces a tf.int64 instead of tf.int32. It doesn't look like we have any control over it outside the API

jmwoloso commented 2 years ago

There are always more layers, it seems @ichenjia :) I think we definitely have some control, or at least a way to hack it to prove the theory (thanks Python!). Could you try something like below as a temporary work around to see if it solves it?

I haven't looked at the source extensively, but maybe as a permanent fix we could add some dtype checking in _get_output_signature of the dataset in order to preserve what is passed in, but I'd defer to the HF crew on what, if anything, could/should be done assuming this hack works.

But until then, maybe this will help. We can try overriding that private method. (Also, to get the markdown formatting to show as a script, enclose your code with 3 backticks instead of 1).

*Edit was to fix formatting

import types

import numpy as np

def _get_output_signature(
    dataset: "Dataset",
    collate_fn: Callable,
    collate_fn_args: dict,
    cols_to_retain: Optional[List[str]] = None,
    batch_size: Optional[int] = None,
    num_test_batches: int = 10,
    """Private method used by `to_tf_dataset()` to find the shapes and dtypes of samples from this dataset
       after being passed through the collate_fn. Tensorflow needs an exact signature for tf.numpy_function, so
       the only way to do this is to run test batches - the collator may add or rename columns, so we can't figure
       it out just by inspecting the dataset.
        dataset (:obj:`Dataset`): Dataset to load samples from.
        collate_fn(:obj:`bool`): Shuffle the dataset order when loading. Recommended True for training, False for
        collate_fn(:obj:`Callable`): A function or callable object (such as a `DataCollator`) that will collate
            lists of samples into a batch.
        collate_fn_args (:obj:`Dict`): A `dict` of keyword arguments to be passed to the
        batch_size (:obj:`int`, optional): The size of batches loaded from the dataset. Used for shape inference.
            Can be None, which indicates that batch sizes can be variable.
        :obj:`dict`: Dict mapping column names to tf.Tensorspec objects
        :obj:`dict`: Dict mapping column names to np.dtype objects
    if config.TF_AVAILABLE:
        import tensorflow as tf
        raise ImportError("Called a Tensorflow-specific function but Tensorflow is not installed.")

    if len(dataset) == 0:
        raise ValueError("Unable to get the output signature because the dataset is empty.")
    if batch_size is None:
        test_batch_size = min(len(dataset), 8)
        batch_size = min(len(dataset), batch_size)
        test_batch_size = batch_size

    test_batches = []
    for _ in range(num_test_batches):
        indices = sample(range(len(dataset)), test_batch_size)
        test_batch = dataset[indices]
        if cols_to_retain is not None:
            test_batch = {
                key: value
                for key, value in test_batch.items()
                if key in cols_to_retain or key in ("label_ids", "label")
        test_batch = [{key: value[i] for key, value in test_batch.items()} for i in range(test_batch_size)]
        test_batch = collate_fn(test_batch, **collate_fn_args)

    tf_columns_to_signatures = {}
    np_columns_to_dtypes = {}
    for column in test_batches[0].keys():
        raw_arrays = [batch[column] for batch in test_batches]
        # In case the collate_fn returns something strange
        np_arrays = []
        for array in raw_arrays:
            if isinstance(array, np.ndarray):
            elif isinstance(array, tf.Tensor):

        if np.issubdtype(np_arrays[0].dtype, np.integer) or np_arrays[0].dtype == bool:
            tf_dtype = tf.int32 # formerly tf.int64
            np_dtype = np.int32 # formerly tf.int64
        elif np.issubdtype(np_arrays[0].dtype, np.number):
            tf_dtype = tf.float32
            np_dtype = np.float32
        elif np_arrays[0].dtype.kind == "U":  # Unicode strings
            np_dtype = np.unicode_
            tf_dtype = tf.string
            raise RuntimeError(
                f"Unrecognized array dtype {np_arrays[0].dtype}. \n"
                "Nested types and image/audio types are not supported yet."
        shapes = [array.shape for array in np_arrays]
        static_shape = []
        for dim in range(len(shapes[0])):
            sizes = set([shape[dim] for shape in shapes])
            if dim == 0:
            if len(sizes) == 1:  # This dimension looks constant
            else:  # Use None for variable dimensions
        tf_columns_to_signatures[column] = tf.TensorSpec(shape=static_shape, dtype=tf_dtype)
        np_columns_to_dtypes[column] = np_dtype

    return tf_columns_to_signatures, np_columns_to_dtypes

my_dict = {'text': ["random text 1", "random text 2", "random text 3"],
'label': [0, 0, 1]}

from datasets import Dataset

dataset = Dataset.from_dict(my_dict)

from transformers import LongformerTokenizer, TFLongformerForSequenceClassification
tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')

def tokenize_function(examples):
    r=tokenizer(examples["text"], padding="max_length", truncation=True)
    r['input_ids']= [tf.convert_to_tensor(row, dtype=tf.int32)
    for row in r['input_ids']]
    r['attention_mask']= [tf.convert_to_tensor(row, dtype=tf.int32)
    for row in r['attention_mask']]
    return r

tokenized_datasets =, batched=True)

small_train_dataset = tokenized_datasets.shuffle(seed=42)

from transformers import DefaultDataCollator

data_collator = DefaultDataCollator(return_tensors="tf")

# override our instance method
tf_train_dataset._get_output_signature = types.MethodType(_get_output_signature, tf_train_dataset)

tf_train_dataset = small_train_dataset.to_tf_dataset(
columns=["attention_mask", "input_ids", "token_type_ids"],
), batch_size=1)
Rocketknight1 commented 2 years ago

Hi @jmwoloso @ichenjia, sorry for only seeing this now! Just to clarify, are you encountering difficulties passing tf.int64 values to TFLongFormer? You're correct that the to_tf_dataset and prepare_tf_dataset methods cast all int outputs to tf.int64, but this is because our policy is that our models should always accept tf.int64 for any integer tensor inputs. If you're encountering issues with that, it's more likely a bug in LongFormer than in to_tf_dataset!

jmwoloso commented 2 years ago

Hi @Rocketknight1 thanks for the reply. That all makes sense. This thread has kind of morphed, but I believe you solved the original issue which dealt with trying to pass ragged tensors to the model.

The next issue that came up from that was that the TensorScatterUpdate op in TF expects tf.int32 inputs (according to the traceback) but was getting tf.int64. That originates in the module when the global_attention_mask is created.

I can take a look and see if there is anything to be done in that longformer file, but this seems like a lower-level TF op issue to me. But you are the TF scape-GOAT around here, so I'll defer to your guidance/wisdom :)

Rocketknight1 commented 2 years ago

Hi @jmwoloso, the code for TFLongformer was indeed using lots of tf.int32, which it shouldn't. Our tests weren't picking that up for some reason - I'll have to investigate that later. For now, can you try the PR and let me know if it fixes your issues? You can install from the PR branch with pip install --upgrade git+

jmwoloso commented 2 years ago

Thanks @Rocketknight1! @ichenjia see if that solves your issue.

Hi @jmwoloso, the code for TFLongformer was indeed using lots of tf.int32, which it shouldn't. Our tests weren't picking that up for some reason - I'll have to investigate that later. For now, can you try the PR and let me know if it fixes your issues? You can install from the PR branch with pip install --upgrade git+

ichenjia commented 2 years ago

Thank you @Rocketknight1 and @jmwoloso for the clear explanation and your check-in does solve the int32 issue. However, I think the check-in may have brought int another issue.

My understanding is that the global_attention_mask is calculated at run-time instead of being provided, which is also marked as Optional in the API.

So when I call, batch_size=1)

The following line was called:

longformer/ call * global_attention_mask = tf.cast(global_attention_mask, tf.int64)

and the following error occurred

`python3.8/site-packages/tensorflow/python/framework/ make_tensor_proto raise ValueError("None values not supported.")

ValueError: None values not supported.`

I am guessing global_attention_mask was forcefully cast even though None was provided.

Is that correct understanding?

jmwoloso commented 2 years ago

@ichenjia can you try explicitly passing in the global_attention_mask? I believe it ends up just being constructed on the fly with tf.zeroes_like method so maybe you could try that to get you unstuck?

ichenjia commented 2 years ago

@ichenjia can you try explicitly passing in the global_attention_mask? I believe it ends up just being constructed on the fly with tf.zeroes_like method so maybe you could try that to get you unstuck?

Thank you @jmwoloso

I manually created a global attention mask in the tokenizer function:

from transformers import LongformerTokenizer, TFLongformerForSequenceClassification
import tensorflow as tf
import pickle
import numpy as np
from transformers import DefaultDataCollator
import numpy as np
my_dict = {'text': ["random text 1", "randome text 2", "beautiful randome text 3"],
            'label': [0,0,1]}

from datasets import Dataset

dataset = Dataset.from_dict(my_dict)

def tokenize_function(examples):
    r=tokenizer(examples["text"], padding="max_length", truncation=True)
    return r

tokenized_datasets =, batched=True)
data_collator = DefaultDataCollator(return_tensors="tf")

tf_train_dataset = tokenized_datasets.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids", 'global_attention_mask'],

model = TFLongformerForSequenceClassification.from_pretrained('allenai/longformer-base-4096', num_labels=2)
), batch_size=1)

It immediately produced an OOM error

ResourceExhaustedError: OOM when allocating tensor with shape[12,16,196864] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:StridedSlice] name: tf_longformer_for_sequence_classification/longformer/encoder/layer_._5/attention/self/strided_slice/

I have a Titan RTX with 24GB of VRAM on that GPU. How much RAM does this need? Am I doing something wrong with again?

jmwoloso commented 2 years ago

ahhh...Longformer is pretty chunky, that's for sure. Have you tried BigBird (google/bigbird-roberta-base) by chance @ichenjia?

jmwoloso commented 2 years ago

That doesn't solve this particular issue, but while we look into fixing it, I'm assuming your need is to handle longer sequence lengths than the typical Bert-like models are pre-trained on.

ichenjia commented 2 years ago

ahhh...Longformer is pretty chunky, that's for sure. Have you tried BigBird (google/bigbird-roberta-base) by chance @ichenjia?

Thanks! I have not tried it because it only supports Torch not TF right?

ichenjia commented 2 years ago

You are talking about


jmwoloso commented 2 years ago

yeah, you're right...I assumed the TF-flavor of BigBird would have been the easiest lift to implement, but maybe not. can you revert back @Rocketknight1's PR and run it again, but post the entire output/traceback so I can take a look @ichenjia?

EDIT: I mean use his PR again and try running your script again without explicitly making and passing in the global_attention_mask and post the output/traceback here and I can probably get you a fix.

ichenjia commented 2 years ago

Thank you for trying to get to the bottom of it. Here is the code I ran:

from transformers import LongformerTokenizer, TFLongformerForSequenceClassification
import tensorflow as tf
import pickle
import numpy as np
from transformers import DefaultDataCollator

import numpy as np
my_dict = {'text': ["random text 1", "randome text 2", "beautiful randome text 3"],
            'label': [0,0,1]}

from datasets import Dataset

dataset = Dataset.from_dict(my_dict)

tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')

def tokenize_function(examples):
    r=tokenizer(examples["text"], padding="max_length", truncation=True)
    return r

tokenized_datasets =, batched=True)

data_collator = DefaultDataCollator(return_tensors="tf")

tf_train_dataset = tokenized_datasets.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids", 'global_attention_mask'],

model = TFLongformerForSequenceClassification.from_pretrained('allenai/longformer-base-4096', num_labels=2)
), batch_size=1)

and here is the track:

ValueError                                Traceback (most recent call last)
<ipython-input-4-fccdbd4c6c6d> in <module>
      5     metrics=tf.metrics.SparseCategoricalAccuracy(),
      6 )
----> 7, batch_size=1)

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/keras/engine/ in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
   1182                 _r=1):
   1183               callbacks.on_train_batch_begin(step)
-> 1184               tmp_logs = self.train_function(iterator)
   1185               if data_handler.should_sync:
   1186                 context.async_wait()

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/keras/engine/ in train_function(iterator)
    851       def train_function(iterator):
    852         """Runs a training execution with one step."""
--> 853         return step_function(self, iterator)
    855     else:

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/keras/engine/ in step_function(model, iterator)
    841       data = next(iterator)
--> 842       outputs =, args=(data,))
    843       outputs = reduce_per_replica(
    844           outputs, self.distribute_strategy, reduction='first')

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/tensorflow/python/distribute/ in run(***failed resolving arguments***)
   1284       fn = autograph.tf_convert(
   1285           fn, autograph_ctx.control_status_ctx(), convert_by_default=False)
-> 1286       return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
   1288   def reduce(self, reduce_op, value, axis):

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/tensorflow/python/distribute/ in call_for_each_replica(self, fn, args, kwargs)
   2847       kwargs = {}
   2848     with self._container_strategy().scope():
-> 2849       return self._call_for_each_replica(fn, args, kwargs)
   2851   def _call_for_each_replica(self, fn, args, kwargs):

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/tensorflow/python/distribute/ in _call_for_each_replica(self, fn, args, kwargs)
   3630   def _call_for_each_replica(self, fn, args, kwargs):
   3631     with ReplicaContext(self._container_strategy(), replica_id_in_sync_group=0):
-> 3632       return fn(*args, **kwargs)
   3634   def _reduce_to(self, reduce_op, value, destinations, options):

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/tensorflow/python/autograph/impl/ in wrapper(*args, **kwargs)
    595   def wrapper(*args, **kwargs):
    596     with ag_ctx.ControlStatusCtx(status=ag_ctx.Status.UNSPECIFIED):
--> 597       return func(*args, **kwargs)
    599   if inspect.isfunction(func) or inspect.ismethod(func):

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/keras/engine/ in run_step(data)
    834       def run_step(data):
--> 835         outputs = model.train_step(data)
    836         # Ensure counter is updated only if `train_step` succeeds.
    837         with tf.control_dependencies(_minimum_control_deps(outputs)):

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/transformers/ in train_step(self, data)
   1390         # Run forward pass.
   1391         with tf.GradientTape() as tape:
-> 1392             y_pred = self(x, training=True)
   1393             if self._using_dummy_loss:
   1394                 loss = self.compiled_loss(y_pred.loss, y_pred.loss, sample_weight, regularization_losses=self.losses)

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/keras/engine/ in __call__(self, *args, **kwargs)
   1035         with autocast_variable.enable_auto_cast_variables(
   1036             self._compute_dtype_object):
-> 1037           outputs = call_fn(inputs, *args, **kwargs)
   1039         if self._activity_regularizer:

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/transformers/ in run_call_with_unpacked_inputs(self, *args, **kwargs)
    406         unpacked_inputs = input_processing(func, config, **fn_args_and_kwargs)
--> 407         return func(self, **unpacked_inputs)
    409     # Keras enforces the first layer argument to be passed, and checks it through `inspect.getfullargspec()`. This

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/transformers/models/longformer/ in call(self, input_ids, attention_mask, head_mask, token_type_ids, position_ids, global_attention_mask, inputs_embeds, output_attentions, output_hidden_states, return_dict, labels, training)
   2389             global_attention_mask = tf.convert_to_tensor(global_attention_mask, dtype=tf.int64)
   2390         else:
-> 2391             global_attention_mask = tf.cast(global_attention_mask, tf.int64)
   2393         if global_attention_mask is None and input_ids is not None:

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/tensorflow/python/util/ in wrapper(*args, **kwargs)
    204     """Call target, and fall back on dispatchers if there is a TypeError."""
    205     try:
--> 206       return target(*args, **kwargs)
    207     except (TypeError, ValueError):
    208       # Note: convert_to_eager_tensor currently raises a ValueError, not a

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/tensorflow/python/ops/ in cast(x, dtype, name)
    986       # allows some conversions that cast() can't do, e.g. casting numbers to
    987       # strings.
--> 988       x = ops.convert_to_tensor(x, name="x")
    989       if x.dtype.base_dtype != base_type:
    990         x = gen_math_ops.cast(x, base_type, name=name)

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/tensorflow/python/profiler/ in wrapped(*args, **kwargs)
    161         with Trace(trace_name, **trace_kwargs):
    162           return func(*args, **kwargs)
--> 163       return func(*args, **kwargs)
    165     return wrapped

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/tensorflow/python/framework/ in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, dtype_hint, ctx, accepted_result_types)
   1565     if ret is None:
-> 1566       ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
   1568     if ret is NotImplemented:

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/tensorflow/python/framework/ in _constant_tensor_conversion_function(v, dtype, name, as_ref)
    344                                          as_ref=False):
    345   _ = as_ref
--> 346   return constant(v, dtype=dtype, name=name)

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/tensorflow/python/framework/ in constant(value, dtype, shape, name)
    269     ValueError: if called on a symbolic tensor.
    270   """
--> 271   return _constant_impl(value, dtype, shape, name, verify_shape=False,
    272                         allow_broadcast=True)

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/tensorflow/python/framework/ in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast)
    281       with trace.Trace("tf.constant"):
    282         return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
--> 283     return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
    285   g = ops.get_default_graph()

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/tensorflow/python/framework/ in _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
    306 def _constant_eager_impl(ctx, value, dtype, shape, verify_shape):
    307   """Creates a constant on the current device."""
--> 308   t = convert_to_eager_tensor(value, ctx, dtype)
    309   if shape is None:
    310     return t

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/tensorflow/python/framework/ in convert_to_eager_tensor(value, ctx, dtype)
    104       dtype = dtypes.as_dtype(dtype).as_datatype_enum
    105   ctx.ensure_initialized()
--> 106   return ops.EagerTensor(value, ctx.device_name, dtype)

ValueError: Attempt to convert a value (None) with an unsupported type (<class 'NoneType'>) to a Tensor.
ichenjia commented 2 years ago

i don't think that's gonna fix the OOM error right?

jmwoloso commented 2 years ago

Yeah that won't fix that OOM error, but I wanted to see the full stack to help track down what we can do to adjust the base PR to get you unblocked. I'm not at my comp right now but will take a look tomorrow and see how we can adjust to make it work.

Rocketknight1 commented 2 years ago

Hi all, I made a bunch of edits and hopefully things should work more smoothly now! Let me know if the problems remain.

jmwoloso commented 2 years ago

Thanks @Rocketknight1, much appreciated!

jmwoloso commented 2 years ago

Can you try it again @ichenjia?

ichenjia commented 2 years ago

Sorry, I was busy yesterday. here is what I did:

pip install --upgrade git+

Then ran the same code and still got the error. Did I install from the right branch?

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/transformers/models/longformer/ in call(self, input_ids, attention_mask, head_mask, token_type_ids, position_ids, global_attention_mask, inputs_embeds, output_attentions, output_hidden_states, return_dict, labels, training)
   2389             global_attention_mask = tf.convert_to_tensor(global_attention_mask, dtype=tf.int64)
   2390         else:
-> 2391             global_attention_mask = tf.cast(global_attention_mask, tf.int64)
   2393         if global_attention_mask is None and input_ids is not None:

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/tensorflow/python/util/ in wrapper(*args, **kwargs)
    204     """Call target, and fall back on dispatchers if there is a TypeError."""
    205     try:
--> 206       return target(*args, **kwargs)
    207     except (TypeError, ValueError):
    208       # Note: convert_to_eager_tensor currently raises a ValueError, not a

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/tensorflow/python/ops/ in cast(x, dtype, name)
    986       # allows some conversions that cast() can't do, e.g. casting numbers to
    987       # strings.
--> 988       x = ops.convert_to_tensor(x, name="x")
    989       if x.dtype.base_dtype != base_type:
    990         x = gen_math_ops.cast(x, base_type, name=name)

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/tensorflow/python/profiler/ in wrapped(*args, **kwargs)
    161         with Trace(trace_name, **trace_kwargs):
    162           return func(*args, **kwargs)
--> 163       return func(*args, **kwargs)
    165     return wrapped

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/tensorflow/python/framework/ in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, dtype_hint, ctx, accepted_result_types)
   1565     if ret is None:
-> 1566       ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
   1568     if ret is NotImplemented:

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/tensorflow/python/framework/ in _constant_tensor_conversion_function(v, dtype, name, as_ref)
    344                                          as_ref=False):
    345   _ = as_ref
--> 346   return constant(v, dtype=dtype, name=name)

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/tensorflow/python/framework/ in constant(value, dtype, shape, name)
    269     ValueError: if called on a symbolic tensor.
    270   """
--> 271   return _constant_impl(value, dtype, shape, name, verify_shape=False,
    272                         allow_broadcast=True)

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/tensorflow/python/framework/ in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast)
    281       with trace.Trace("tf.constant"):
    282         return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
--> 283     return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
    285   g = ops.get_default_graph()

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/tensorflow/python/framework/ in _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
    306 def _constant_eager_impl(ctx, value, dtype, shape, verify_shape):
    307   """Creates a constant on the current device."""
--> 308   t = convert_to_eager_tensor(value, ctx, dtype)
    309   if shape is None:
    310     return t

~/anaconda3/envs/tf_gpu/lib/python3.8/site-packages/tensorflow/python/framework/ in convert_to_eager_tensor(value, ctx, dtype)
    104       dtype = dtypes.as_dtype(dtype).as_datatype_enum
    105   ctx.ensure_initialized()
--> 106   return ops.EagerTensor(value, ctx.device_name, dtype)

ValueError: Attempt to convert a value (None) with an unsupported type (<class 'NoneType'>) to a Tensor.
Rocketknight1 commented 2 years ago

Hi @ichenjia, the command you ran looks correct but the traceback you pasted refers to an old version of the code. (global_attention_mask = tf.cast(global_attention_mask, tf.int64) is not on line 2391 anymore)

Can you try pip uninstall transformers and then rerunning the command above, and then restarting any jupyter notebook servers you're running to make sure you're using the PR branch?

Rocketknight1 commented 2 years ago

Hey all - I'm going to merge the PR with the fix so that it can be included in the next release of transformers this week. However, if you have further problems, please reopen the issue and let me know!