rwth-i6 / returnn

The RWTH extensible training framework for universal recurrent neural networks
http://returnn.readthedocs.io/
Other
349 stars 130 forks source link

NotFoundError: No registered 'Const' OpKernel for 'GPU' devices #694

Closed robin-p-schmitt closed 3 years ago

robin-p-schmitt commented 3 years ago

When using the code from https://github.com/rwth-i6/returnn_common/blob/main/models/transducer/recomb_recog.py inside of a transducer experiment (e.g. https://github.com/rwth-i6/returnn-experiments/blob/master/2020-rnn-transducer/configs/rna3c-lm4a.convtrain.switchout6.l2a_1e_4.nohdf.encbottle256.attwb5_am.dec1la-n128.decdrop03.decwdrop03.pretrain_less2_rep6.mlr50.emit2.fl2.fixmask.rna-align-blank0-scratch-swap.encctc.devtrain.config), there is the following error:

NotFoundError: No registered 'Const' OpKernel for 'GPU' devices compatible with node {{node ConstantFolding/output/rec/out_str/global_tensor_shared_vocab_0x479c99809419f4b4/Const_enter}}
     (OpKernel was found, but attributes didn't match) Requested Attributes: _XlaHasReferenceVars=false, dtype=DT_STRING, value=Tensor<type: string shape: [1031] values: <s>  UNK  i ...>, _device="/job:localhost/replica:0/task:0/device:GPU:0"
    .  Registered:  device='XLA_CPU_JIT'; dtype in [DT_FLOAT, DT_DOUBLE, DT_INT32, DT_UINT8, DT_INT16, ..., DT_COMPLEX128, DT_HALF, DT_UINT32, DT_UINT64, DT_STRING]
  device='XLA_GPU_JIT'; dtype in [DT_FLOAT, DT_DOUBLE, DT_INT32, DT_UINT8, DT_INT16, ..., DT_COMPLEX128, DT_HALF, DT_UINT32, DT_UINT64, DT_STRING]
  device='CPU'
  device='GPU'; dtype in [DT_VARIANT]
  device='GPU'; dtype in [DT_BOOL]
  device='GPU'; dtype in [DT_COMPLEX128]
  device='GPU'; dtype in [DT_COMPLEX64]
  device='GPU'; dtype in [DT_UINT64]
  device='GPU'; dtype in [DT_INT64]
  device='GPU'; dtype in [DT_QINT32]
  device='GPU'; dtype in [DT_UINT32]
  device='GPU'; dtype in [DT_QUINT16]
  device='GPU'; dtype in [DT_QINT16]
  device='GPU'; dtype in [DT_INT16]
  device='GPU'; dtype in [DT_UINT16]
  device='GPU'; dtype in [DT_QINT8]
  device='GPU'; dtype in [DT_INT8]
  device='GPU'; dtype in [DT_UINT8]
  device='GPU'; dtype in [DT_DOUBLE]
  device='GPU'; dtype in [DT_FLOAT]
  device='GPU'; dtype in [DT_BFLOAT16]
  device='GPU'; dtype in [DT_HALF]
  device='GPU'; dtype in [DT_INT32]
  device='XLA_CPU'; dtype in [DT_UINT8, DT_QUINT8, DT_UINT16, DT_INT8, DT_QINT8, ..., DT_DOUBLE, DT_COMPLEX64, DT_COMPLEX128, DT_BOOL, DT_BFLOAT16]
  device='XLA_GPU'; dtype in [DT_UINT8, DT_QUINT8, DT_UINT16, DT_INT8, DT_QINT8, ..., DT_DOUBLE, DT_COMPLEX64, DT_COMPLEX128, DT_BOOL, DT_BFLOAT16]

     [[ConstantFolding/output/rec/out_str/global_tensor_shared_vocab_0x479c99809419f4b4/Const_enter]]

The C++ code is defined as follows:

def get_filtered_score_op(verbose=False):
  """
  :return: TF op
  """
  cpp_code = """
    #include "tensorflow/core/framework/op.h"
    #include "tensorflow/core/framework/op_kernel.h"
    #include "tensorflow/core/framework/shape_inference.h"
    #include "tensorflow/core/framework/resource_mgr.h"
    #include "tensorflow/core/framework/resource_op_kernel.h"
    #include "tensorflow/core/framework/tensor.h"
    #include "tensorflow/core/platform/macros.h"
    #include "tensorflow/core/platform/mutex.h"
    #include "tensorflow/core/platform/types.h"
    #include "tensorflow/core/public/version.h"
    #include <cmath>
    #include <map>
    #include <set>
    #include <string>
    #include <tuple>
    using namespace tensorflow;
    REGISTER_OP("GetFilteredScore")
    .Input("out_str: string")
    .Input("scores: float32")
    .Output("new_scores: float32")
    .SetShapeFn([](::tensorflow::shape_inference::InferenceContext* c) {
        c->set_output(0, c->input(1));
        return Status::OK();
    });
    class GetFilteredScoreOp : public OpKernel {
    public:
    using OpKernel::OpKernel;
    void Compute(OpKernelContext* context) override {
        const Tensor* out_str = &context->input(0);
        const Tensor* scores = &context->input(1);
        int n_batch = out_str->shape().dim_size(0);
        int n_beam = out_str->shape().dim_size(1);
        Tensor* ret;
        OP_REQUIRES_OK(context, context->allocate_output(0, TensorShape({n_batch, n_beam}), &ret));
        for(int bat = 0; bat < n_batch; ++bat)
            for(int hyp = 0; hyp < n_beam; ++hyp)
                ret->tensor<float, 2>()(bat, hyp) = scores->tensor<float, 2>()(bat, hyp);
        for(int bat = 0; bat < n_batch; ++bat) {
            std::map<tstring, std::set<int> > new_hyps;  // seq -> set of hyp idx
            for(int hyp = 0; hyp < n_beam; ++hyp) {
                auto& seq_set = new_hyps[out_str->tensor<tstring, 2>()(bat, hyp)];
                seq_set.insert(hyp);
            }
            for(const auto& items : new_hyps) {
                if(std::get<1>(items).size() > 1) {
                    float best_score = 0.;
                    int best_idx = -1;
                    for(int idx : std::get<1>(items)) {
                        float score = scores->tensor<float, 2>()(bat, idx);
                        if(score > best_score || best_idx == -1) {
                            best_score = score;
                            best_idx = idx;
                        }
                    }
                    float sum_score = 0.;
                    for(int idx : std::get<1>(items)) {
                        float score = scores->tensor<float, 2>()(bat, idx);
                        sum_score += expf(score - best_score);
                    }
                    sum_score = logf(sum_score) + best_score;
                    for(int idx : std::get<1>(items)) {
                        if(idx != best_idx)
                            ret->tensor<float, 2>()(bat, idx) = -std::numeric_limits<float>::infinity();
                        else
                            ret->tensor<float, 2>()(bat, idx) = sum_score;
                    }
                }
            }
        }
    }
    };
    REGISTER_KERNEL_BUILDER(Name("GetFilteredScore").Device(DEVICE_CPU), GetFilteredScoreOp);
    """
  from returnn.tf.util.basic import OpCodeCompiler
  compiler = OpCodeCompiler(
    base_name="GetFilteredScore", code_version=1, code=cpp_code,
    is_cpp=True, use_cuda_if_available=False, verbose=verbose)
  tf_mod = compiler.load_tf_module()
  return tf_mod.get_filtered_score

and is called via:

with tf.device("/cpu:0"):
    return get_filtered_score_op()(out_str, scores)

The out_str variable comes from:

def get_vocab_tf():
    from GeneratingDataset import Vocabulary
    import TFUtil
    import tensorflow as tf
    vocab = Vocabulary.create_vocab(**sprint_interface_dataset_opts["bpe"])
    labels = vocab.labels  # bpe labels ("@@" at end, or not), excluding blank
    labels = [(l + " ").replace("@@ ", "") for l in labels] + [""]
    labels_t = TFUtil.get_shared_vocab(labels)
    return labels_t

def get_vocab_sym(i):
    """
    :param tf.Tensor i: e.g. [B], int32
    :return: same shape as input, string
    :rtype: tf.Tensor
    """
    import tensorflow as tf
    return tf.gather(params=get_vocab_tf(), indices=i)

def out_str(source, **kwargs):
    # ["prev:out_str", "output_emit", "output"]
    import tensorflow as tf
    from TFUtil import where_bc
    return source(0) + where_bc(source(1), get_vocab_sym(source(2)), tf.constant(""))

and this is also where the get_shared_vocab function is called (labels_t = TFUtil.get_shared_vocab(labels)), which seems to cause the error.

We currently think that this could be an issue with the TF version (2.3) I am currently using. @albertz

albertz commented 3 years ago

Can you edit this to resolve the escape codes, remove prologue and epilogue, and shorten it. Basically only the NotFoundError: No registered 'Const' OpKernel ... up to [[ConstantFolding/output/rec/out_str/global_tensor_shared_vocab_0x479c99809419f4b4/Const_enter]] is relevant.

And that you use TF 2.3.0, which you can just say.

And also shortly include the relevant code snippet from code or config (I guess which uses get_shared_vocab).

robin-p-schmitt commented 3 years ago

So I finally found the solution for the problem:

.Input("out_str: string") .Input("scores: float32")

Simply adding .Input("labels: string") below these two lines in the C++ code solves the problem (or at least the error is not thrown anymore and the config runs through). The function call then changes to:

with tf.device("/cpu:0"):
        labels_t = TFUtil.get_shared_vocab(labels)
        return get_filtered_score_op()(prev_str, scores, labels_t)

labels is defined by:

from GeneratingDataset import Vocabulary
bpe = {
        'bpe_file': '/work/asr3/irie/data/switchboard/subword_clean/ready/swbd_clean.bpe_code_1k',
        'vocab_file': '/work/asr3/irie/data/switchboard/subword_clean/ready/vocab.swbd_clean.bpe_code_1k',
}
vocab = Vocabulary.create_vocab(bpe)
labels = vocab.labels  # bpe labels ("@@" at end, or not), excluding blank
labels = [(l + " ").replace("@@ ", "").encode("utf8") for l in labels] + [b""]

I am not sure why this is the problem, but it seems that the functions defined in https://github.com/rwth-i6/returnn_common/blob/main/models/transducer/recomb_recog.py don't work without adding the labels option.

albertz commented 3 years ago

I don't exactly understand what you mean. You cannot just add some non-used input to this unrelated op. That doesn't make sense. The GetFilteredScore TF op only has two inputs.

Maybe for some strange reason it leads to the effect that the error is gone, but then this is sth totally different anyway.

You should better understand what exactly is the problem here. And then just fix the problem. Do not randomly try to change other things.

albertz commented 3 years ago

What about what I suggested, to try a new TF version?

robin-p-schmitt commented 3 years ago

I don't exactly understand what you mean. You cannot just add some non-used input to this unrelated op. That doesn't make sense. The GetFilteredScore TF op only has two inputs.

In the returnn configs here on git (e.g. https://github.com/rwth-i6/returnn-experiments/blob/master/2020-rnn-transducer/configs/rna3c-lm4a.convtrain.switchout6.l2a_1e_4.nohdf.encbottle256.attwb5_am.dec1la-n128.decdrop03.decwdrop03.pretrain_less2_rep6.mlr50.emit2.fl2.fixmask.rna-align-blank0-scratch-swap.encctc.devtrain.config), the label argument is included. I don't know if there is a reason for that. But I will try and find out what causes the error

robin-p-schmitt commented 3 years ago

What about what I suggested, to try a new TF version?

Yes, I tried with tf 2.4 but this also didn't work

albertz commented 3 years ago

I don't exactly understand what you mean. You cannot just add some non-used input to this unrelated op. That doesn't make sense. The GetFilteredScore TF op only has two inputs.

In the returnn configs here on git (e.g.), the label argument is included. I don't know if there is a reason for that.

It was because at some earlier point, I used it inside the op. Then I did not use it anymore and was too lazy to clean that up. So, cleaning that up is not the question. We cannot just leave it in because this avoids some other unrelated bug.

Or first, we should understand the problem itself. It's possible that we maybe need some workaround. But this would not be it, and esp not without understanding it.

albertz commented 3 years ago

I tried with tf 2.4 but this also didn't work

What about a more recent version, like TF 2.6?

JackTemaki commented 3 years ago

Or first, we should understand the problem itself

The Problem seems to be related to a TF Op with type "String" (DT_STRING) that does not exist for GPU. I think this is not directly related to the CPP code. I am not sure why it did not occur before, maybe for Andre it did place this op on CPU, or for some other reasons the data type was already different...

The CPP code is explicitely build for CPU execution, this is why I doubt the error is there.

albertz commented 3 years ago

I'm not sure if the TF Const op is maybe not possible for strings (dtype=tf.string) on GPU. This is also what the error says.

I don't really find any documentation saying that, or wonder why TF is not able to automatically handle this in some way. Or I assume it does already automatically handle it in other cases, as we and others are working fine with strings in other cases. So maybe there is some TF bug why this automatic handling does not work here, although I don't understand it. Maybe related to XLA, or graph optimizations (it mentions constant folding).

I also don't find too much related errors, except maybe this, this, this, this.

But anyway, maybe in get_shared_vocab in RETURNN, we just should add this:

with tf.device("/cpu:0"):

Can you try this?

albertz commented 3 years ago

One idea about why the TF automatic handling does not work properly: Maybe get_shared_vocab (or get_vocab_tf) gets called at some early stage, where TF is still in eager mode, or where the graph exists but not the session yet. Then the later get_shared_vocab call will share the same const op from before. The earlier call does not know about the session and possible constraints (what device, etc), so maybe it registers the op in some strange way.

robin-p-schmitt commented 3 years ago

Changing the out_str function to:

def out_str(source, **kwargs):
    # ["prev:out_str", "output_emit", "output"]
    import tensorflow as tf
    from TFUtil import where_bc
    with tf.device("/cpu:0"):
        return source(0) + where_bc(source(1), get_vocab_sym(source(2)), tf.constant(""))

worked. I first tried only adding it around get_shared_vocab but this didn't work.

albertz commented 3 years ago

I first tried only adding it around get_shared_vocab but this didn't work.

Not around. Inside it.

robin-p-schmitt commented 3 years ago

I first tried only adding it around get_shared_vocab but this didn't work.

Not around. Inside it.

I think the problem might be caused by the tf.constant("") in out_str, this would also explain the Const part of the error message. @JackTemaki mentioned this idea to me earlier

albertz commented 3 years ago

this would also explain the Const part of the error message

It is this op: ConstantFolding/output/rec/out_str/global_tensor_shared_vocab_0x479c99809419f4b4/Const_enter

So this clearly is from get_shared_vocab. And then via const folding automatically folded into some other things.

Did you try it anyway inside get_shared_vocab? I think it's anyway needed inside get_shared_vocab. You are maybe just lucky that this was the first call to get_shared_vocab now and thus it worked.

But maybe you need both then. Both inside get_shared_vocab (when it is called from other code) and in out_str.

robin-p-schmitt commented 3 years ago

I first tried only adding it around get_shared_vocab but this didn't work.

Not around. Inside it.

Only adding it inside the get_shared_vocab function does not work for me and throws the same error.

albertz commented 3 years ago

Only adding it inside the get_shared_vocab function does not work for me and throws the same error.

Really the same, or is the op name different now?

albertz commented 3 years ago

But as said, we should do both then: Both inside get_shared_vocab (when it is called from other code) and in out_str.

robin-p-schmitt commented 3 years ago

Only adding it inside the get_shared_vocab function does not work for me and throws the same error.

Really the same, or is the op name different now?

NotFoundError: No registered 'Const' OpKernel for 'GPU' devices compatible with node {{node ConstantFolding/output/rec/out_str/global_tensor_shared_vocab_0x7938a8f5f52c9097/Const_enter}}
         (OpKernel was found, but attributes didn't match) Requested Attributes: _XlaHasReferenceVars=false, dtype=DT_STRING, value=Tensor<type: string shape: [1031] values: <s>  UNK  i ...>, _device="/job:localhost/
replica:0/task:0/device:GPU:0"
...
         [[ConstantFolding/output/rec/out_str/global_tensor_shared_vocab_0x7938a8f5f52c9097/Const_enter]]
albertz commented 3 years ago

I was able to replicate the TF exception with a small demo. I reported it here: https://github.com/tensorflow/tensorflow/issues/52200

albertz commented 3 years ago

But anyway, it works with what we discussed?

... Both inside get_shared_vocab (when it is called from other code) and in out_str.

Can you do a PR for that?

albertz commented 3 years ago

Note that the TF control behavior V2 does not seem to have the problem, as I tested in my small little demo code (tensorflow/tensorflow#52200). However, enabling TF control flow behavior V2 is not ready yet: #700

But anyway, just use the workarounds as discussed, which solve this, right?

robin-p-schmitt commented 3 years ago

But anyway, it works with what we discussed?

... Both inside get_shared_vocab (when it is called from other code) and in out_str.

Can you do a PR for that?

Yes, the error is solved for me now. I will do a PR tomorrow.

robin-p-schmitt commented 3 years ago

Maybe for some strange reason it leads to the effect that the error is gone, but then this is sth totally different anyway.

I have a theory why the labels variable seems to solve the error: calling get_shared_vocab creates a global tensor which is shared across the computation graph. And because labels_t = TFUtil.get_shared_vocab(labels) was called inside of with tf.device("/cpu:0"): and it was called after out_str, the shared vocab was moved on CPU in the whole graph. Therefore, when leaving out the labels, get_shared_vocab was only called in out_str which didn't have the with tf.device("/cpu:0"): and therefore it was executed on GPU, which is not allowed for strings.

The problem is solved now anyways but I thought it would be interesting to know the reason why the unused labels variable seemed to help.

albertz commented 3 years ago

I have a theory why the labels variable seems to solve the error: calling get_shared_vocab creates a global tensor which is shared across the computation graph.

Yes, this is what I mentioned before. This is what "shared" means.

And because labels_t = TFUtil.get_shared_vocab(labels) was called inside of with tf.device("/cpu:0"): and it was called after out_str, the shared vocab was moved on CPU in the whole graph.

It depends where exactly this is called first. This is what I asked you before.

And then, it also depends on how TF handles this. If a const string can only be on CPU anyway, in the usual cases, TF anyway puts it on CPU automatically. By using tf.device("/cpu:0"), you just enforce this. So maybe tf.device("/cpu:0") has no effect normally, except in the cases where TF fails for some reason.

This is exactly what I reproduced and reported here: tensorflow/tensorflow#52200

Therefore, when leaving out the labels, get_shared_vocab was only called in out_str which didn't have the with tf.device("/cpu:0"): and therefore it was executed on GPU, which is not allowed for strings.

This does not fully explains it. Normally TF anyway can handle that automatically (as I also would expect). See also tensorflow/tensorflow#52200. When you try simpler variants of the same code, e.g. using control flow V2 (#700), or not having a while_loop at all, the same code works correctly, where you don't specify the device for the const string.

albertz commented 3 years ago

On RETURNN side, this should be fixed now via #702. The other part is just about the config. Although you might want to do a PR for some relevant configs in returnn-experiments.