google-research / pegasus

Apache License 2.0
1.61k stars 316 forks source link

evaluate.py not using GPU #12

Open sshleifer opened 4 years ago

sshleifer commented 4 years ago

I ran the setup instructions on a preixisting GCP machine with cuda 10.1 and one modification:

mv ckpt/pegasus_ckpt ckpt2

(Instructions don't work as written because they don't acknowledge the pegasus_ckpt subdirectory, or that you need to point --model_dir to a specific checkpoint file, which is the only way I got evaluate.py to run.

Then, I ran

python3 pegasus/bin/evaluate.py --params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt2/c4.unigram.newline.10pct.96000.model,batch_size=1,beam_size=5,beam_alpha=0.6 --model_dir ckpt2/aeslc/model.ckpt-32000 | tee -a pegasus_output.txt

and it is running on 8 CPU cores, nvidia-smi similarly shows 0 GPU utilization.

How can I fix that?

Env:

mesh-tensorflow==0.1.13
tensor2tensor==1.15.0
tensorboard==1.15.0
tensorboardX==2.0
tensorflow==1.15.3
tensorflow-datasets==3.0.0
tensorflow-estimator==1.15.1
tensorflow-gan==2.0.0
tensorflow-gpu==1.15.0
tensorflow-hub==0.8.0
tensorflow-metadata==0.21.2
tensorflow-probability==0.7.0
tensorflow-text==1.15.0rc0
JingqingZ commented 4 years ago

Hi, I experienced a similar issue before. This is probably because tensorflow-gpu is installed (if you use pip install) before tensorflow-text (or other tensorflow-* sorry I can't remember), which depends on CPU-version tensorflow. I fixed this by installing tensorflow-gpu at the last when the virtualenv is created. The requirements.txt also lists tensorflow-gpu at the last.

Hope this may help you fix.

I will update the instruction regarding the ckpt path.

sshleifer commented 4 years ago

made a new venv, ran pip install -r requirements.txt, and unfortunately the behavior is identical. What does your pip freeze | grep tensor look like?

JingqingZ commented 4 years ago

From pip freeze | grep tensor:

mesh-tensorflow==0.1.13
tensor2tensor==1.15.0
tensorboard==1.15.0
tensorflow==1.15.2
tensorflow-datasets==3.0.0
tensorflow-estimator==1.15.1
tensorflow-gan==2.0.0
tensorflow-gpu==1.15.0
tensorflow-hub==0.8.0
tensorflow-metadata==0.21.2
tensorflow-probability==0.7.0
tensorflow-text==1.15.0rc0

Since both tensorflow and tensorflow-gpu are installed, so you probably need to make sure python imports tensorflow-gpu instead of tensorflow.

sshleifer commented 4 years ago

Switched machines, cause I think tensorflow-gpu==1.15.0 requires cuda 10.0. That got me to a new error:

What is your pip freeze | grep tfds? I'm at tfds-nightly==1.0.1.dev201903050105

Traceback

Preparing to unpack .../zsh_5.3.1-4+b3_amd64.deb ...
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "pegasus/bin/evaluate.py", line 144, in main
    FLAGS.enable_logging)
  File "/home/shleifer/pegasus/pegasus/eval/text_eval.py", line 153, in text_eval
    for i, features in enumerate(features_iter):
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3078, in predict
    rendezvous.raise_errors()
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
    six.reraise(typ, value, traceback)
  File "/opt/conda/lib/python3.7/site-packages/six.py", line 703, in reraise
    raise value
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3072, in predict
    yield_single_examples=yield_single_examples):
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 620, in predict
    input_fn, ModeKeys.PREDICT)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 996, in _get_features_from_input_fn
    result = self._call_input_fn(input_fn, mode)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2987, in _call_input_fn
    return input_fn(**kwargs)
  File "/home/shleifer/pegasus/pegasus/data/infeed.py", line 41, in input_fn
    dataset = all_datasets.get_dataset(input_pattern, training)
  File "/home/shleifer/pegasus/pegasus/data/all_datasets.py", line 52, in get_dataset
    dataset, _ = builder.build(input_pattern, shuffle_files)
  File "/home/shleifer/pegasus/pegasus/data/datasets.py", line 200, in build
    dataset, num_examples = self.load(build_name, split, shuffle_files)
  File "/home/shleifer/pegasus/pegasus/data/datasets.py", line 158, in load
    data_dir=self.data_dir)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_datasets/core/api_utils.py", line 52, in disallow_positional_args_dec
    return fn(*args, **kwargs)
TypeError: load() got an unexpected keyword argument 'shuffle_files'
JingqingZ commented 4 years ago

My output: tfds-nightly==3.0.0.dev202004160105

BTW: the code was initially developed by python 3.6 but I think 3.7 should be fine.