Train supervised model - Githubissues

liuleiBUAA commented 2 years ago

When I run run_sup_example.sh, the code stuck in this step, and only use 2 GPU(I have 4)

You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Loading cached processed dataset at ./data/csv/default-5c8c01abeb2e7fe5/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2/cache-45b85a8c6af7082e.arrow [INFO|trainer.py:441] 2022-03-14 16:47:35,367 >> The following columns in the training set don't have a corresponding argument in BertForCL.forward and have been ignored: . [INFO|trainer.py:358] 2022-03-14 16:47:35,368 >> Using amp fp16 backend

When I run the run_unsup_example.sh,the error is

RuntimeError: Input tensor at index 3 has invalid shape [14, 14], but expected [14, 17]

When I change the run_unsup_example.sh from python train.py \ to python -m torch.distributed.launch --nproc_per_node $NUM_GPU --master_port $PORT_ID train.py (same as run_sup_example.sh)

The error is : Traceback (most recent call last): File "train.py", line 585, in main() File "train.py", line 310, in main datasets = load_dataset(extension, data_files=data_files, cache_dir="./data/") File "/home/liulei741/anaconda3/lib/python3.8/site-packages/datasets/load.py", line 590, in load_dataset module_path, hash = prepare_module( File "/home/liulei741/anaconda3/lib/python3.8/site-packages/datasets/load.py", line 267, in prepare_module local_path = cached_path(file_path, download_config=download_config) File "/home/liulei741/anaconda3/lib/python3.8/site-packages/datasets/utils/file_utils.py", line 334, in cached_path output_path = get_from_cache( File "/home/liulei741/anaconda3/lib/python3.8/site-packages/datasets/utils/file_utils.py", line 617, in get_from_cache raise ConnectionError("Couldn't reach {}".format(url)) ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/1.2.1/datasets/text/text.py Traceback (most recent call last): File "train.py", line 585, in main() File "train.py", line 310, in main datasets = load_dataset(extension, data_files=data_files, cache_dir="./data/") File "/home/liulei741/anaconda3/lib/python3.8/site-packages/datasets/load.py", line 590, in load_dataset module_path, hash = prepare_module( File "/home/liulei741/anaconda3/lib/python3.8/site-packages/datasets/load.py", line 267, in prepare_module local_path = cached_path(file_path, download_config=download_config) File "/home/liulei741/anaconda3/lib/python3.8/site-packages/datasets/utils/file_utils.py", line 334, in cached_path output_path = get_from_cache( File "/home/liulei741/anaconda3/lib/python3.8/site-packages/datasets/utils/file_utils.py", line 617, in get_from_cache raise ConnectionError("Couldn't reach {}".format(url)) ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/1.2.1/datasets/text/text.py Traceback (most recent call last): File "/home/liulei741/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/liulei741/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/liulei741/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in main() File "/home/liulei741/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main raise subprocess.CalledProcessError(returncode=process.returncode

and stuck in You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Loading cached processed dataset at ./data/text/default-f485599cbd14a27e/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab/cache-347bc3c0aa9f736e.arrow Loading cached processed dataset at ./data/text/default-f485599cbd14a27e/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab/cache-347bc3c0aa9f736e.arrow

gaotianyu1350 commented 2 years ago

Hi,

It might be a caching issue with the dataset. Can you try (1) deleting the cache directory as suggested in the error message, or (2) trying single gpu first.

liuleiBUAA commented 2 years ago

/data/csv/

Thank you, (2) has been fixed. But after deleting the cache directory (default-5c8c01abeb2e7fe5/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2/),the problem is still there

gaotianyu1350 commented 2 years ago

Hi,

For

RuntimeError: Input tensor at index 3 has invalid shape [14, 14], but expected [14, 17]

Can you provide more information? Like the whole error stack?

Mrsun0 commented 2 years ago

In the file "model.py", a parameters (*model_args) of BertForCL init function seems not be used, is it a bug?

gaotianyu1350 commented 2 years ago

did you use the original train.py file provided by the repo? there should be a model_args variable.

princeton-nlp / SimCSE

Train supervised model #148