Closed aleksglushko closed 2 years ago
What is the error you get in each case?
tf.get_variable
is not about the checkpoint. The model is constructed (and this code is executed) before the checkpoint is loaded.
In the first case:
File "/u/glushko/setups/librispeech/2021-13-08--ilmt-att-sis/returnn/returnn/tf/layers/base.py", line 2106, in ReuseParams.variable_custom_getter
line: assert param_name in self.param_map
locals:
param_name = <local> 'lstm_cell/kernel', len = 16
self = <local> <ReuseParams reuse_layer None, map {'kernel': <ReuseParams reuse_layer None, map None>, 'bias': <ReuseParams reuse_layer None, map None>}>
self.param_map = <local> {'kernel': <ReuseParams reuse_layer None, map None>, 'bias': <ReuseParams reuse_layer None, map None>}
In the second:
File "/u/glushko/setups/librispeech/2021-13-08--ilmt-att-sis/returnn/returnn/tf/network.py", line 4020, in CustomCheckpointLoader.get_variable_value_map
line: raise tf.errors.NotFoundError(
node_def=None, op=None,
message="CustomCheckpointLoader. could_not_find_map_list: %r" % (could_not_find_map_list,))
locals:
tf = <global> <module 'tensorflow' from '/work/tools/asr/python/3.8.0_tf_2.3-v1-generic+cuda10.1/lib/python3.8/site-packages/tensorflow/__init__.py'>
tf.errors = <global> <module 'tensorflow._api.v2.errors' from '/work/tools/asr/python/3.8.0_tf_2.3-v1-generic+cuda10.1/lib/python3.8/site-packages/tensorflow/_api/v2/errors/__init__.py'>
tf.errors.NotFoundError = <global> <class 'tensorflow.python.framework.errors_impl.NotFoundError'>
node_def = <not found>
op = <not found>
message = <not found>
could_not_find_map_list = <local> ['output/rec/s/rec/W', 'output/rec/s/rec/W_re', 'output/rec/s/rec/b'], _[0]: {len = 18}
NotFoundError: CustomCheckpointLoader. could_not_find_map_list: ['output/rec/s/rec/W', 'output/rec/s/rec/W_re', 'output/rec/s/rec/b']
In the seconde case, I thought the function that maps LSTMBlock -> NativeLSTM will mapkernel
and bias
into W, W_re, b
will work and then during the construction, variable will be loaded. And the thing that is not clear that sharing seemed to work, since this appears in the log:
Reused variable: <tf.Variable 'output/rec/s/rec/W:0' shape=(2669, 4000) dtype=float32>
Reused variable: <tf.Variable 'output/rec/s/rec/b:0' shape=(4000,) dtype=float32>
Reused variable: <tf.Variable 'output/rec/s/rec/W_re:0' shape=(1000, 4000) dtype=float32>
layer root/output(rec-subnet-output)/'iLMT_readout_in' output: Data{'iLMT_readout_in_output',
[T|'time:var:extern_data:classes'[B],B,F|F'iLMT_s:feature'(1000)]}
Reused variable: <tf.Variable 'output/rec/readout_in/W:0' shape=(3669, 1000) dtype=float32>
Reused variable: <tf.Variable 'output/rec/readout_in/b:0' shape=(1000,) dtype=float32>
layer root/output(rec-subnet-output)/'iLMT_readout' output: Data{'iLMT_readout_output', [T|'time:var:extern_data:classes'[B],B,F|F'iLMT_s:feature//2'(500)]}
layer root/output(rec-subnet-output)/'iLMT_output_prob' output: Data{'iLMT_output_prob_output', [T|'time:var:extern_data:classes'[B],B,F|F'iLMT_output_prob:feature-dense'(10025)]}
Reused variable: <tf.Variable 'output/rec/output_prob/W:0' shape=(500, 10025) dtype=float32>
Reused variable: <tf.Variable 'output/rec/output_prob/b:0' shape=(10025,) dtype=float32>
The first case is obvious, or not? You map kernel
and bias
in your config, but you should map lstm_cell/kernel
and lstm_cell/bias
instead.
In the second case, this automatic conversion LSTMBlock -> NativeLSTM only works without such custom variable maps.
You could first use the original config and only replace LSTMBlock by NativeLSTM2, and then load the old checkpoint, and just save it directly. When it loads, this converts the params, and then you have a checkpoint with NativeLSTM2. Then you don't need the conversion later on and you can use custom variable maps.
In general, avoid LSTMBlock, and just always use NativeLSTM2.
I guess this issue can be closed as there is no real issue on RETURNN side. But feel free to post any follow-up questions. Or also ask on Slack in the returnn channel.
If i map like you said:
'map': {
'lstm_cell/kernel' : {'custom': lambda **_kwargs: get_var('output/rec/s/rec/rnn/lstm_cell/kernel', _kwargs['shape'])},
'lstm_cell/bias' : {'custom': lambda **_kwargs: get_var('output/rec/s/rec/rnn/lstm_cell/bias', _kwargs['shape'])},
}
Then it doesn't see params. params = []
If i map like you said:
'map': { 'lstm_cell/kernel' : {'custom': lambda **_kwargs: get_var('output/rec/s/rec/rnn/lstm_cell/kernel', _kwargs['shape'])}, 'lstm_cell/bias' : {'custom': lambda **_kwargs: get_var('output/rec/s/rec/rnn/lstm_cell/bias', _kwargs['shape'])}, }
Then it doesn't see params.
params = []
What do you mean by that? Is there some error? What is the error?
If you want the params to be in the params
dict of the layer, you could do:
In your lambda
where you call the get_var
function, just pass all the kwargs
, like get_var(..., **kwargs)
.
Then you also get base_layer
. Extend the code to:
var = tf.get_variable(name, shape)
base_layer.params[name] = var
return var
However, I don't think you actually need that.
Yes, there is an error: /u/glushko/setups/librispeech/2021-13-08--ilmt-att-sis/debug/no_params.log
File "/u/glushko/setups/librispeech/2021-13-08--ilmt-att-sis/returnn/returnn/tf/network.py", line 1101, in TFNetwork.add_layer
line: layer = self._create_layer(name=name, layer_class=layer_class, **layer_desc)
locals:
layer = <not found>
self = <local> <TFNetwork 'root/output(rec-subnet-output)' parent_layer=<RecLayer 'output' out_type=Data{[T|'time:var:extern_data:classes'[B],B], dtype='int32', sparse_dim=Dim{F'classes:sparse-dim'(10025)}}> train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>
self._create_layer = <local> <bound method TFNetwork._create_layer of <TFNetwork 'root/output(rec-subnet-output)' parent_layer=<RecLayer 'output' out_type=Data{[T|'time:var:extern_data:classes'[B],B], dtype='int32', sparse_dim=Dim{F'classes:sparse-dim'(10025)}}> train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>>
name = <local> 'iLMT_s', len = 6
layer_class = <local> <class 'returnn.tf.layers.rec.RnnCellLayer'>
layer_desc = <local> {'L2': 0.0001, 'n_out': 1000, 'reuse_params': <ReuseParams reuse_layer None, map {'lstm_cell/kernel': <ReuseParams reuse_layer None, map None>, 'lstm_cell/bias': <ReuseParams reuse_layer None, map None>}>, 'unit': 'LSTMBlock', '_network': <TFNetwork 'root/output(rec-subnet-output)' parent_layer=<..., len = 7
File "/u/glushko/setups/librispeech/2021-13-08--ilmt-att-sis/returnn/returnn/tf/network.py", line 1016, in TFNetwork._create_layer
line: layer = layer_class(**layer_desc)
locals:
layer = <not found>
layer_class = <local> <class 'returnn.tf.layers.rec.RnnCellLayer'>
layer_desc = <local> {'L2': 0.0001, 'n_out': 1000, 'reuse_params': <ReuseParams reuse_layer None, map {'lstm_cell/kernel': <ReuseParams reuse_layer None, map None>, 'lstm_cell/bias': <ReuseParams reuse_layer None, map None>}>, 'unit': 'LSTMBlock', '_network': <TFNetwork 'root/output(rec-subnet-output)' parent_layer=<..., len = 10
File "/u/glushko/setups/librispeech/2021-13-08--ilmt-att-sis/returnn/returnn/tf/layers/rec.py", line 4387, in RnnCellLayer.__init__
line: assert params
locals:
params = <local> []
AssertionError
Ah, this assert
is not necessary (or rather wrong) there. I pushed a fix for this. Can you try again?
It is working now, thank you!
Problem
Can't share weights for the LSTM ('iLMT_s') layer during training with optimization of the loop. Suggestion that is RETURNN can't find the proper name scopes for the variables that are optimized.
Weights are shared with custom function
Mapping for the LSTMBlock is according to the checkpoint we restore from. (Weights in the checkpoint):
1 case:
when using the same layer, like
'class': 'rnn_cell'
and'unit': 'LSTMBlock'
:log path:
/u/glushko/setups/librispeech/2021-13-08--ilmt-att-sis/debug/LSTMBlock_sharing.log
2 case:
Use nativeLSTM unit instead of LSTMBlock. But the problem remains, since it can't find the proper names in the checkpoint:
log_path:
/u/glushko/setups/librispeech/2021-13-08--ilmt-att-sis/debug/nativelstm_sharing.log
Network
config path:
/u/glushko/setups/librispeech/2021-13-08--ilmt-att-sis/debug/ilmt_debug.config