Closed robin-p-schmitt closed 3 years ago
Can you edit this to resolve the escape codes, remove prologue and epilogue, and shorten it. Basically only the NotFoundError: No registered 'Const' OpKernel ...
up to [[ConstantFolding/output/rec/out_str/global_tensor_shared_vocab_0x479c99809419f4b4/Const_enter]]
is relevant.
And that you use TF 2.3.0, which you can just say.
And also shortly include the relevant code snippet from code or config (I guess which uses get_shared_vocab
).
So I finally found the solution for the problem:
.Input("out_str: string") .Input("scores: float32")
Simply adding .Input("labels: string")
below these two lines in the C++ code solves the problem (or at least the error is not thrown anymore and the config runs through). The function call then changes to:
with tf.device("/cpu:0"):
labels_t = TFUtil.get_shared_vocab(labels)
return get_filtered_score_op()(prev_str, scores, labels_t)
labels
is defined by:
from GeneratingDataset import Vocabulary
bpe = {
'bpe_file': '/work/asr3/irie/data/switchboard/subword_clean/ready/swbd_clean.bpe_code_1k',
'vocab_file': '/work/asr3/irie/data/switchboard/subword_clean/ready/vocab.swbd_clean.bpe_code_1k',
}
vocab = Vocabulary.create_vocab(bpe)
labels = vocab.labels # bpe labels ("@@" at end, or not), excluding blank
labels = [(l + " ").replace("@@ ", "").encode("utf8") for l in labels] + [b""]
I am not sure why this is the problem, but it seems that the functions defined in https://github.com/rwth-i6/returnn_common/blob/main/models/transducer/recomb_recog.py don't work without adding the labels
option.
I don't exactly understand what you mean. You cannot just add some non-used input to this unrelated op. That doesn't make sense. The GetFilteredScore
TF op only has two inputs.
Maybe for some strange reason it leads to the effect that the error is gone, but then this is sth totally different anyway.
You should better understand what exactly is the problem here. And then just fix the problem. Do not randomly try to change other things.
What about what I suggested, to try a new TF version?
I don't exactly understand what you mean. You cannot just add some non-used input to this unrelated op. That doesn't make sense. The
GetFilteredScore
TF op only has two inputs.
In the returnn configs here on git (e.g. https://github.com/rwth-i6/returnn-experiments/blob/master/2020-rnn-transducer/configs/rna3c-lm4a.convtrain.switchout6.l2a_1e_4.nohdf.encbottle256.attwb5_am.dec1la-n128.decdrop03.decwdrop03.pretrain_less2_rep6.mlr50.emit2.fl2.fixmask.rna-align-blank0-scratch-swap.encctc.devtrain.config), the label
argument is included. I don't know if there is a reason for that. But I will try and find out what causes the error
What about what I suggested, to try a new TF version?
Yes, I tried with tf 2.4 but this also didn't work
I don't exactly understand what you mean. You cannot just add some non-used input to this unrelated op. That doesn't make sense. The
GetFilteredScore
TF op only has two inputs.In the returnn configs here on git (e.g.), the
label
argument is included. I don't know if there is a reason for that.
It was because at some earlier point, I used it inside the op. Then I did not use it anymore and was too lazy to clean that up. So, cleaning that up is not the question. We cannot just leave it in because this avoids some other unrelated bug.
Or first, we should understand the problem itself. It's possible that we maybe need some workaround. But this would not be it, and esp not without understanding it.
I tried with tf 2.4 but this also didn't work
What about a more recent version, like TF 2.6?
Or first, we should understand the problem itself
The Problem seems to be related to a TF Op with type "String" (DT_STRING
) that does not exist for GPU. I think this is not directly related to the CPP code. I am not sure why it did not occur before, maybe for Andre it did place this op on CPU, or for some other reasons the data type was already different...
The CPP code is explicitely build for CPU execution, this is why I doubt the error is there.
I'm not sure if the TF Const
op is maybe not possible for strings (dtype=tf.string
) on GPU. This is also what the error says.
I don't really find any documentation saying that, or wonder why TF is not able to automatically handle this in some way. Or I assume it does already automatically handle it in other cases, as we and others are working fine with strings in other cases. So maybe there is some TF bug why this automatic handling does not work here, although I don't understand it. Maybe related to XLA, or graph optimizations (it mentions constant folding).
I also don't find too much related errors, except maybe this, this, this, this.
But anyway, maybe in get_shared_vocab
in RETURNN, we just should add this:
with tf.device("/cpu:0"):
Can you try this?
One idea about why the TF automatic handling does not work properly: Maybe get_shared_vocab
(or get_vocab_tf
) gets called at some early stage, where TF is still in eager mode, or where the graph exists but not the session yet. Then the later get_shared_vocab
call will share the same const op from before. The earlier call does not know about the session and possible constraints (what device, etc), so maybe it registers the op in some strange way.
Changing the out_str
function to:
def out_str(source, **kwargs):
# ["prev:out_str", "output_emit", "output"]
import tensorflow as tf
from TFUtil import where_bc
with tf.device("/cpu:0"):
return source(0) + where_bc(source(1), get_vocab_sym(source(2)), tf.constant(""))
worked. I first tried only adding it around get_shared_vocab
but this didn't work.
I first tried only adding it around
get_shared_vocab
but this didn't work.
Not around. Inside it.
I first tried only adding it around
get_shared_vocab
but this didn't work.Not around. Inside it.
I think the problem might be caused by the tf.constant("")
in out_str
, this would also explain the Const
part of the error message. @JackTemaki mentioned this idea to me earlier
this would also explain the
Const
part of the error message
It is this op: ConstantFolding/output/rec/out_str/global_tensor_shared_vocab_0x479c99809419f4b4/Const_enter
So this clearly is from get_shared_vocab
. And then via const folding automatically folded into some other things.
Did you try it anyway inside get_shared_vocab
? I think it's anyway needed inside get_shared_vocab
. You are maybe just lucky that this was the first call to get_shared_vocab
now and thus it worked.
But maybe you need both then. Both inside get_shared_vocab
(when it is called from other code) and in out_str
.
I first tried only adding it around
get_shared_vocab
but this didn't work.Not around. Inside it.
Only adding it inside the get_shared_vocab
function does not work for me and throws the same error.
Only adding it inside the
get_shared_vocab
function does not work for me and throws the same error.
Really the same, or is the op name different now?
But as said, we should do both then: Both inside get_shared_vocab
(when it is called from other code) and in out_str
.
Only adding it inside the
get_shared_vocab
function does not work for me and throws the same error.Really the same, or is the op name different now?
NotFoundError: No registered 'Const' OpKernel for 'GPU' devices compatible with node {{node ConstantFolding/output/rec/out_str/global_tensor_shared_vocab_0x7938a8f5f52c9097/Const_enter}}
(OpKernel was found, but attributes didn't match) Requested Attributes: _XlaHasReferenceVars=false, dtype=DT_STRING, value=Tensor<type: string shape: [1031] values: <s> UNK i ...>, _device="/job:localhost/
replica:0/task:0/device:GPU:0"
...
[[ConstantFolding/output/rec/out_str/global_tensor_shared_vocab_0x7938a8f5f52c9097/Const_enter]]
I was able to replicate the TF exception with a small demo. I reported it here: https://github.com/tensorflow/tensorflow/issues/52200
But anyway, it works with what we discussed?
... Both inside
get_shared_vocab
(when it is called from other code) and inout_str
.
Can you do a PR for that?
Note that the TF control behavior V2 does not seem to have the problem, as I tested in my small little demo code (tensorflow/tensorflow#52200). However, enabling TF control flow behavior V2 is not ready yet: #700
But anyway, just use the workarounds as discussed, which solve this, right?
But anyway, it works with what we discussed?
... Both inside
get_shared_vocab
(when it is called from other code) and inout_str
.Can you do a PR for that?
Yes, the error is solved for me now. I will do a PR tomorrow.
Maybe for some strange reason it leads to the effect that the error is gone, but then this is sth totally different anyway.
I have a theory why the labels
variable seems to solve the error: calling get_shared_vocab
creates a global tensor which is shared across the computation graph. And because labels_t = TFUtil.get_shared_vocab(labels)
was called inside of with tf.device("/cpu:0"):
and it was called after out_str,
the shared vocab was moved on CPU in the whole graph. Therefore, when leaving out the labels
, get_shared_vocab
was only called in out_str
which didn't have the with tf.device("/cpu:0"):
and therefore it was executed on GPU, which is not allowed for strings.
The problem is solved now anyways but I thought it would be interesting to know the reason why the unused labels
variable seemed to help.
I have a theory why the
labels
variable seems to solve the error: callingget_shared_vocab
creates a global tensor which is shared across the computation graph.
Yes, this is what I mentioned before. This is what "shared" means.
And because
labels_t = TFUtil.get_shared_vocab(labels)
was called inside ofwith tf.device("/cpu:0"):
and it was called afterout_str,
the shared vocab was moved on CPU in the whole graph.
It depends where exactly this is called first. This is what I asked you before.
And then, it also depends on how TF handles this. If a const string can only be on CPU anyway, in the usual cases, TF anyway puts it on CPU automatically. By using tf.device("/cpu:0")
, you just enforce this. So maybe tf.device("/cpu:0")
has no effect normally, except in the cases where TF fails for some reason.
This is exactly what I reproduced and reported here: tensorflow/tensorflow#52200
Therefore, when leaving out the
labels
,get_shared_vocab
was only called inout_str
which didn't have thewith tf.device("/cpu:0"):
and therefore it was executed on GPU, which is not allowed for strings.
This does not fully explains it. Normally TF anyway can handle that automatically (as I also would expect). See also tensorflow/tensorflow#52200. When you try simpler variants of the same code, e.g. using control flow V2 (#700), or not having a while_loop
at all, the same code works correctly, where you don't specify the device for the const string.
On RETURNN side, this should be fixed now via #702. The other part is just about the config. Although you might want to do a PR for some relevant configs in returnn-experiments.
When using the code from https://github.com/rwth-i6/returnn_common/blob/main/models/transducer/recomb_recog.py inside of a transducer experiment (e.g. https://github.com/rwth-i6/returnn-experiments/blob/master/2020-rnn-transducer/configs/rna3c-lm4a.convtrain.switchout6.l2a_1e_4.nohdf.encbottle256.attwb5_am.dec1la-n128.decdrop03.decwdrop03.pretrain_less2_rep6.mlr50.emit2.fl2.fixmask.rna-align-blank0-scratch-swap.encctc.devtrain.config), there is the following error:
The C++ code is defined as follows:
and is called via:
The
out_str
variable comes from:and this is also where the
get_shared_vocab
function is called (labels_t = TFUtil.get_shared_vocab(labels)
), which seems to cause the error.We currently think that this could be an issue with the TF version (2.3) I am currently using. @albertz