ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.15k stars 5.8k forks source link

[Train] Provide a list of models for people to choose from in the HF transformer example #36837

Open scottsun94 opened 1 year ago

scottsun94 commented 1 year ago

Description

I tried 2 transformer models on HF, both of which didn't work.

(base)  ray@g-784b96e5cffee0001:~/default$ /home/ray/anaconda3/bin/python /home/ray/default/test-9.py --model_name_or_path gpt2 --task_name cola
2023-06-26 15:12:52.363657: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-26 15:12:52.528325: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-06-26 15:12:53.352464: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-06-26 15:12:53.352612: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-06-26 15:12:53.352629: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
comet_ml is installed but `COMET_API_KEY` is not set.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 725.87it/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 665/665 [00:00<00:00, 3.61MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 0.99M/0.99M [00:00<00:00, 15.5MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 446k/446k [00:00<00:00, 9.27MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.29M/1.29M [00:00<00:00, 19.4MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 523M/523M [00:04<00:00, 115MB/s]
loading weights file https://huggingface.co/gpt2/resolve/main/pytorch_model.bin from cache at /home/ray/.cache/huggingface/transformers/752929ace039baa8ef70fe21cdf9ab9445773d20e733cf693d667982e210837e.323c769945a351daa25546176f8208b3004b6f563438a7603e7932bae9025925
All model checkpoint weights were used when initializing GPT2ForSequenceClassification.

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Running tokenizer on dataset: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 56.10ba/s]
Running tokenizer on dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 117.81ba/s]
Running tokenizer on dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 118.41ba/s]
[15:13:04] INFO     Sample 4314 of the training set: {'input_ids': [1026, 3088, 284, 6290, 13], 'attention_mask': [1, 1, 1, 1, 1], 'labels': 0}.                                         test-9.py:428
           INFO     Sample 5772 of the training set: {'input_ids': [23865, 15063, 48241, 351, 257, 24556, 290, 2269, 585, 494, 750, 523, 1165, 13], 'attention_mask': [1, 1, 1, 1, 1, 1, test-9.py:428
                    1, 1, 1, 1, 1, 1, 1, 1], 'labels': 1}.                                                                                                                                            
           INFO     Sample 5763 of the training set: {'input_ids': [42493, 33577, 2630, 257, 3734, 3348, 319, 36079, 313, 721, 13], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], test-9.py:428
                    'labels': 1}.                                                                                                                                                                     
/home/ray/anaconda3/lib/python3.8/site-packages/accelerate/accelerator.py:499: FutureWarning: The `use_fp16` property is deprecated and will be removed in version 1.0 of Accelerate use `Accelerator.mixed_precision == 'fp16'` instead.
  warnings.warn(
/home/ray/anaconda3/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
           INFO     ***** Running training *****                                                                                                                                         test-9.py:519
           INFO       Num examples = 8551                                                                                                                                                test-9.py:520
           INFO       Num Epochs = 3                                                                                                                                                     test-9.py:521
           INFO       Instantaneous batch size per device = 8                                                                                                                            test-9.py:522
           INFO       Total train batch size (w. parallel, distributed & accumulation) = 8                                                                                               test-9.py:526
           INFO       Gradient Accumulation steps = 1                                                                                                                                    test-9.py:530
           INFO       Total optimization steps = 3207                                                                                                                                    test-9.py:531
  0%|                                                                                                                                                                        | 0/3207 [00:00<?, ?it/s]╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│                                                                                                  │
│ /home/ray/default/test-9.py:629 in <module>                                                      │
│                                                                                                  │
│   626                                                                                            │
│   627                                                                                            │
│   628 if __name__ == "__main__":                                                                 │
│ ❱ 629 │   main()                                                                                 │
│ /home/ray/default/test-9.py:625 in main                                                          │
│                                                                                                  │
│   622 │                                                                                          │
│   623 │   else:                                                                                  │
│   624 │   │   # Run training locally.                                                            │
│ ❱ 625 │   │   train_func(config)                                                                 │
│   626                                                                                            │
│   627                                                                                            │
│   628 if __name__ == "__main__":                                                                 │
│                                                                                                  │
│ /home/ray/default/test-9.py:540 in train_func                                                    │
│                                                                                                  │
│   537 │                                                                                          │
│   538 │   for epoch in range(args.num_train_epochs):                                             │
│   539 │   │   model.train()                                                                      │
│ ❱ 540 │   │   for step, batch in enumerate(train_dataloader):                                    │
│   541 │   │   │   outputs = model(**batch)                                                       │
│   542 │   │   │   loss = outputs.loss                                                            │
│   543 │   │   │   loss = loss / args.gradient_accumulation_steps                                 │
│                                                                                                  │
│ /home/ray/anaconda3/lib/python3.8/site-packages/accelerate/data_loader.py:377 in __iter__        │
│                                                                                                  │
│   374 │   │   dataloader_iter = super().__iter__()                                               │
│   375 │   │   # We iterate one batch ahead to check when we are at the end                       │
│   376 │   │   try:                                                                               │
│ ❱ 377 │   │   │   current_batch = next(dataloader_iter)                                          │
│   378 │   │   except StopIteration:                                                              │
│   379 │   │   │   yield                                                                          │
│   380                                                                                            │
│                                                                                                  │
│ /home/ray/anaconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py:628 in __next__   │
│                                                                                                  │
│    625 │   │   │   if self._sampler_iter is None:                                                │
│    626 │   │   │   │   # TODO(https://github.com/pytorch/pytorch/issues/76750)                   │
│    627 │   │   │   │   self._reset()  # type: ignore[call-arg]                                   │
│ ❱  628 │   │   │   data = self._next_data()                                                      │
│    629 │   │   │   self._num_yielded += 1                                                        │
│    630 │   │   │   if self._dataset_kind == _DatasetKind.Iterable and \                          │
│    631 │   │   │   │   │   self._IterableDataset_len_called is not None and \                    │
│                                                                                                  │
│ /home/ray/anaconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py:671 in _next_data │
│                                                                                                  │
│    668 │                                                                                         │
│    669 │   def _next_data(self):                                                                 │
│    670 │   │   index = self._next_index()  # may raise StopIteration                             │
│ ❱  671 │   │   data = self._dataset_fetcher.fetch(index)  # may raise StopIteration              │
│    672 │   │   if self._pin_memory:                                                              │
│    673 │   │   │   data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)            │
│    674 │   │   return data                                                                       │
│                                                                                                  │
│ /home/ray/anaconda3/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py:61 in fetch     │
│                                                                                                  │
│   58 │   │   │   │   data = [self.dataset[idx] for idx in possibly_batched_index]                │
│   59 │   │   else:                                                                               │
│   60 │   │   │   data = self.dataset[possibly_batched_index]                                     │
│ ❱ 61 │   │   return self.collate_fn(data)                                                        │
│   62                                                                                             │
│                                                                                                  │
│ /home/ray/anaconda3/lib/python3.8/site-packages/transformers/data/data_collator.py:247 in        │
│ __call__                                                                                         │
│                                                                                                  │
│    244 │   return_tensors: str = "pt"                                                            │
│    245 │                                                                                         │
│    246 │   def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:                 │
│ ❱  247 │   │   batch = self.tokenizer.pad(                                                       │
│    248 │   │   │   features,                                                                     │
│    249 │   │   │   padding=self.padding,                                                         │
│    250 │   │   │   max_length=self.max_length,                                                   │
│                                                                                                  │
│ /home/ray/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:2836 in  │
│ pad                                                                                              │
│                                                                                                  │
│   2833 │   │   │   │   encoded_inputs[key] = to_py_obj(value)                                    │
│   2834 │   │                                                                                     │
│   2835 │   │   # Convert padding_strategy in PaddingStrategy                                     │
│ ❱ 2836 │   │   padding_strategy, _, max_length, _ = self._get_padding_truncation_strategies(     │
│   2837 │   │   │   padding=padding, max_length=max_length, verbose=verbose                       │
│   2838 │   │   )                                                                                 │
│   2839                                                                                           │
│                                                                                                  │
│ /home/ray/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:2372 in  │
│ _get_padding_truncation_strategies                                                               │
│                                                                                                  │
│   2369 │   │                                                                                     │
│   2370 │   │   # Test if we have a padding token                                                 │
│   2371 │   │   if padding_strategy != PaddingStrategy.DO_NOT_PAD and (not self.pad_token or sel  │
│ ❱ 2372 │   │   │   raise ValueError(                                                             │
│   2373 │   │   │   │   "Asking to pad but the tokenizer does not have a padding token. "         │
│   2374 │   │   │   │   "Please select a token to use as `pad_token` `(tokenizer.pad_token = tok  │
│   2375 │   │   │   │   "or add a new pad token via `tokenizer.add_special_tokens({'pad_token':   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via 
`tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.
  0%|                                                                                                                                                                        | 0/3207 [00:00<?, ?it/s]
(base)  ray@g-784b96e5cffee0001:~/default$ /home/ray/anaconda3/bin/python /home/ray/default/test-9.py --model_name_or_path finbert --task_name cola
2023-06-26 15:16:59.542883: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-26 15:16:59.708840: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-06-26 15:17:00.545921: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-06-26 15:17:00.546040: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-06-26 15:17:00.546054: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
comet_ml is installed but `COMET_API_KEY` is not set.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 916.12it/s]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│                                                                                                  │
│ /home/ray/anaconda3/lib/python3.8/site-packages/transformers/configuration_utils.py:601 in       │
│ _get_config_dict                                                                                 │
│                                                                                                  │
│   598 │   │                                                                                      │
│   599 │   │   try:                                                                               │
│   600 │   │   │   # Load from URL or cache if already cached                                     │
│ ❱ 601 │   │   │   resolved_config_file = cached_path(                                            │
│   602 │   │   │   │   config_file,                                                               │
│   603 │   │   │   │   cache_dir=cache_dir,                                                       │
│   604 │   │   │   │   force_download=force_download,                                             │
│ /home/ray/anaconda3/lib/python3.8/site-packages/transformers/utils/hub.py:282 in cached_path     │
│                                                                                                  │
│    279 │                                                                                         │
│    280 │   if is_remote_url(url_or_filename):                                                    │
│    281 │   │   # URL, so get it from the cache (downloading if necessary)                        │
│ ❱  282 │   │   output_path = get_from_cache(                                                     │
│    283 │   │   │   url_or_filename,                                                              │
│    284 │   │   │   cache_dir=cache_dir,                                                          │
│    285 │   │   │   force_download=force_download,                                                │
│                                                                                                  │
│ /home/ray/anaconda3/lib/python3.8/site-packages/transformers/utils/hub.py:545 in get_from_cache  │
│                                                                                                  │
│    542 │   │   │   │   │   │   " to False."                                                      │
│    543 │   │   │   │   │   )                                                                     │
│    544 │   │   │   │   else:                                                                     │
│ ❱  545 │   │   │   │   │   raise ValueError(                                                     │
│    546 │   │   │   │   │   │   "Connection error, and we cannot find the requested files in the  │
│    547 │   │   │   │   │   │   " Please try again or make sure your Internet connection is on."  │
│    548 │   │   │   │   │   )                                                                     │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│                                                                                                  │
│ /home/ray/default/test-9.py:629 in <module>                                                      │
│                                                                                                  │
│   626                                                                                            │
│   627                                                                                            │
│   628 if __name__ == "__main__":                                                                 │
│ ❱ 629 │   main()                                                                                 │
│ /home/ray/default/test-9.py:625 in main                                                          │
│                                                                                                  │
│   622 │                                                                                          │
│   623 │   else:                                                                                  │
│   624 │   │   # Run training locally.                                                            │
│ ❱ 625 │   │   train_func(config)                                                                 │
│   626                                                                                            │
│   627                                                                                            │
│   628 if __name__ == "__main__":                                                                 │
│                                                                                                  │
│ /home/ray/default/test-9.py:322 in train_func                                                    │
│                                                                                                  │
│   319 │   #                                                                                      │
│   320 │   # In distributed training, the .from_pretrained methods guarantee that                 │
│   321 │   # only one local process can concurrently download model & vocab.                      │
│ ❱ 322 │   config = AutoConfig.from_pretrained(                                                   │
│   323 │   │   args.model_name_or_path, num_labels=num_labels, finetuning_task=args.task_name     │
│   324 │   )                                                                                      │
│   325 │   tokenizer = AutoTokenizer.from_pretrained(                                             │
│                                                                                                  │
│ /home/ray/anaconda3/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py:6 │
│ 80 in from_pretrained                                                                            │
│                                                                                                  │
│   677 │   │   kwargs["_from_auto"] = True                                                        │
│   678 │   │   kwargs["name_or_path"] = pretrained_model_name_or_path                             │
│   679 │   │   trust_remote_code = kwargs.pop("trust_remote_code", False)                         │
│ ❱ 680 │   │   config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path,   │
│   681 │   │   if "auto_map" in config_dict and "AutoConfig" in config_dict["auto_map"]:          │
│   682 │   │   │   if not trust_remote_code:                                                      │
│   683 │   │   │   │   raise ValueError(                                                          │
│                                                                                                  │
│ /home/ray/anaconda3/lib/python3.8/site-packages/transformers/configuration_utils.py:553 in       │
│ get_config_dict                                                                                  │
│                                                                                                  │
│   550 │   │   """                                                                                │
│   551 │   │   original_kwargs = copy.deepcopy(kwargs)                                            │
│   552 │   │   # Get config dict associated with the base config file                             │
│ ❱ 553 │   │   config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwar   │
│   554 │   │                                                                                      │
│   555 │   │   # That config file may point us toward another config file to use.                 │
│   556 │   │   if "configuration_files" in config_dict:                                           │
│                                                                                                  │
│ /home/ray/anaconda3/lib/python3.8/site-packages/transformers/configuration_utils.py:634 in       │
│ _get_config_dict                                                                                 │
│                                                                                                  │
│   631 │   │   │   │   f"There was a specific connection error when trying to load {pretrained_   │
│   632 │   │   │   )                                                                              │
│   633 │   │   except ValueError:                                                                 │
│ ❱ 634 │   │   │   raise EnvironmentError(                                                        │
│   635 │   │   │   │   f"We couldn't connect to '{HUGGINGFACE_CO_RESOLVE_ENDPOINT}' to load thi   │
│   636 │   │   │   │   f"files and it looks like {pretrained_model_name_or_path} is not the pat   │
│   637 │   │   │   │   f"{configuration_file} file.\nCheckout your internet connection or see h   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OSError: We couldn't connect to 'https://huggingface.co' to load this model, couldn't find it in the cached files and it looks like finbert is not the path to a directory 
containing a config.json file.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

Link

No response

scottsun94 commented 1 year ago

cc: @matthewdeng @woshiyyya

matthewdeng commented 1 year ago

Do you have a repro?

woshiyyya commented 1 year ago

@matthewdeng If I am not wrong, Huaiwei randomly picked a model from HF model hub (finbert here)

python examples/transformers/transformers_example.py --model_name_or_path= finbert --task_name=cola

The problem is we are testing this example with a specific list of arguments (bert-base-cased + mrpc), which works perfectly. But when switching to other models, there are some unknown issues(e.g. tokenizer pad token unset, cannot download a delta model, ..).

Probably we can set our testing configs as default, so that the users can smoothly try out our example.

scottsun94 commented 1 year ago

@woshiyyya is right. I just ran the HF example in the doc with the arguments pasted above. https://docs.ray.io/en/latest/train/examples/transformers/transformers_example.html

matthewdeng commented 1 year ago

Oh got it. There is a README here but overall the example is outside. We should revise to make it up to date with the original Transformers example and/or push the example down the list in the docs - it's not as nice or informative as the notebook examples.