neccam / nslt

Neural Sign Language Translation (CVPR'18)
Apache License 2.0
149 stars 41 forks source link

Problem during training #11

Closed vbelissen closed 5 years ago

vbelissen commented 5 years ago

@neccam Hi, I still have the same problem as pointed out by @hec44 (even with your change os.environ["CUDA_VISIBLE_DEVICES"] = "0")

tensorflow.python.framework.errors_impl.InvalidArgumentError: TypeError: bad argument type for built-in operation [[Node: PyFunc = PyFunc[Tin=[DT_STRING, DT_BOOL], Tout=[DT_FLOAT], token="pyfunc_5"](arg0, PyFunc/input_1)]] [[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[300,227,227,3], [1]], output_types=[DT_FLOAT, DT_INT32], _device="/job:localhost/replica:0/task:0/cpu:0"]] [[Node: dynamic_seq2seq/decoder/decoder/while/TensorArrayWrite_1/TensorArrayWriteV3/_189 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_1669_dynamic_seq2seq/decoder/decoder/while/TensorArrayWrite_1/TensorArrayWriteV3", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]]

Here is the command I used: python -m nmt --src=sign --tgt=de --train_prefix=../Data/phoenix2014T.train --dev_prefix=../Data/phoenix2014T.dev --test_prefix=../Data/phoenix2014T.test --out_dir=../test_out/ --vocab_prefix=../Data/phoenix2014T.vocab --source_reverse=True --num_units=1000 --num_layers=4 --num_train_steps=150000 --residual=True --attention=luong --base_gpu=0 --unit_type=gru --batch_size=1 --num_gpus=1

Would you have any insight on how to solve this problem?

neccam commented 5 years ago

Quite strange. Which version of TensorFlow are you using? Can you give it a go with Tensorflow 1.2.1 or Tensorflow 1.3.0?

vbelissen commented 5 years ago

Thanks for the quick answer! I tried with tensorflow-gpu 1.3.0 and 1.4.0. I'll try with 1.2.1

vbelissen commented 5 years ago

Well, unfortunately the same error happens on tf 1.2.1 too...

neccam commented 5 years ago

That is strange. It seems to be complaining about getting a bad argument in one of the pyfunctions of iterator_utils. I will try to look in to it asap and get back to you.

zjzhengyin commented 5 years ago

@neccam Hi,I also have this problem,what wrong with the code ,how can I do? tensorflow.python.framework.errors_impl.InvalidArgumentError: TypeError: bad argument type for built-in operation [[Node: PyFunc = PyFunc[Tin=[DT_STRING, DT_BOOL], Tout=[DT_FLOAT], token="pyfunc_5"](arg0, PyFunc/input_1)]] [[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[300,227,227,3], [1]], output_types=[DT_FLOAT, DT_INT32], _device="/job:localhost/replica:0/task:0/cpu:0"]] [[Node: dynamic_seq2seq/decoder/decoder/while/TensorArrayWrite_2/TensorArrayWriteV3/_169 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0

neccam commented 5 years ago

Hi, I think I found the cause of the issue, which I believe is to be the python version.

I was able to train with this code using my old environment, which is based on: Tensorflow 1.4, Cuda 8.0, CuDNN 6.0 and python 2.7

When I created a new environment with the same parameters but use python 3.5 instead, I recieved the following error, which is same as your:

tensorflow.python.framework.errors_impl.InvalidArgumentError: TypeError: bad argument type for built-in operation
         [[Node: PyFunc = PyFunc[Tin=[DT_STRING, DT_BOOL], Tout=[DT_FLOAT], token="pyfunc_5"](arg0, PyFunc/input_1)]]
         [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[300,227,227,3], [1]], output_types=[DT_FLOAT, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator)]]

I have updated the README to reflect that you need python 2.7 to run the code.

Thanks for finding out and creating the issue :+1:

Claire874 commented 1 year ago

How to configure the environment, it seems python 2.7 doesn't satisfy the tensorflow