Open albertz opened 2 years ago
The current goal is that you can run python3 rnn.py demos/demo-torch.config
and it starts the training.
Note that I created the config option backend
, so you do backend = "torch"
in the config to enable the PyTorch backend.
I also created an initial returnn.torch
package/directory, with a dummy Engine
in engine.py
.
Feel free to create a data_pipeline.py
file when you start working on implementing the dataset and related code.
Btw, as we work in master, please make sure the tests are passing. There is no PyTorch test at all yet, so the only test relevant for us currently is about code inspections, like PEP8 etc. Please double check that there are no warnings on the code before you commit.
For commit messages, maybe prefix them with "PyTorch" or "Torch" or so.
For the main train loop, see the PyTorch quickstart tutorial (slightly simplified, adapted):
So, in the RETURNN Engine
, we would also have the same to loop over the batches:
for batch, (X, y) in enumerate(dataloader):
X, y = X.to(device), y.to(device)
We also would have a such a model
instance. This is somewhat like the root TFNetwork
, or the root module in returnn-common. It is necessary to have such a root module, to have unique param names. See corresponding discussion in returnn-common.
As I understand the PyTorch code, I think we need to have one single loss
in the end, such that we have a single loss.backward()
call. I think multiple backward()
calls would lead to backprop multiple times and this would be stupid. But this is not really a problem: We just need to do the summation of all losses you potentially have. We also do this for TF/Theano.
Maybe, like in returnn-common, we can have the same mark_as_loss
API. So, some initial config API suggestions:
def get_model(*, epoch: int, **_unused_kwargs) -> torch.nn.Module:
Would create the model
instance. I'm not sure if we really need to or should pass extern_data
here, as you should have all the relevant information anyway in the config. But maybe it makes it simpler to write it? Originally I also thought about just providing class Model(torch.nn.Module)
, with the API that Model()
should work.
def train_step(*, extern_data: ExternData, model: Model, ctx: TrainCtx):
Here you would do:
pred = model(extern_data.data["inputs"].placeholder)
loss = F.cross_entrop(pred, y)
ctx.mark_as_loss(loss)
The TrainCtx
provides this mark_as_loss
method, which would just collect the losses. Maybe together with a name. Then in the engine, the train loop could look sth like this:
model.train()
ctx = TrainCtx()
for batch, (X, y) in enumerate(dataloader):
X, y = X.to(device), y.to(device)
extern_data.set(X, y) # ...
pred = train_step(extern_data=extern_data, model=model, ctx=ctx)
loss = ctx.total_loss()
optimizer.zero_grad()
loss.backward()
optimizer.step()
I'm implementing a dataset wrapper. The code is still too ugly for a pull request, so first a comment: π
The general interface is clear:
from torch.utils.data import IterableDataset
class DatasetWrapper(IterableDataset):
def __init__(self, returnn_dataset):
...
def __iter__(self):
...
And then in engine.train()
something like:
from torch.utils.data import DataLoader
train_data = DatasetWrapper(self.train_dataset)
data_loader = DataLoader(train_data)
for batch_index, batch_data in enumerate(data_loader):
...
Most intuitive would be to let DatasetWrapper.__iter__()
return a generator over single sequences and then do batching via the arguments to DataLoader
. However, for IterableDataset
the batch_sampler
and collate_fn
arguments are not available. You can set DataLoader(train_data, batch_size=42)
, but this would really just put a constant amount of sequences into a batch, not constant amount of time frames and other more sophisticated logic we want to have.
So I think, DatasetWrapper.__iter__()
already has to provide the batches. I'm currently trying to figure out how to best reuse the existing batching code, so Dataset.generate_batches()
, BatchSetGenerator
, FeedDictDataProvider.get_next_batch()
etc. Will continue tomorrow... π
@albertz, it looks to me that having a separate loss function in the training loop is not strictly necessary. Those available loss functions output normal Tensors. So you could have the loss as part of the Module if you want "all calculations to be equal". But that would rather be suitable when defining a network dict, not so much when loading an existing Module, because people normally don't do it that way...
No, the IterableDataset
is supposed to return individual sequences, not batches.
I think the DataLoader
is supposed to do that. Yes, I have read somewhere that it is a bit limited in functionality in that it only supports fixed size batches. But I'm sure there are other solutions, other DataLoaderExt
implementations or whatever. Maybe just check how other frameworks like Fairseq have done this.
it looks to me that having a separate loss function in the training loop is not strictly necessary.
I never said that? See my suggestion on how to do it, i.e. using the TrainCtx
.
So you could have the loss as part of the Module
This is now totally up to the user. You could either totally decouple it, i.e. your model just is the model, without losses, and then you have separate code to calculate the losses. Or you could mix it together, define the losses already within the model (when you pass TrainCtx
to it). In returnn-common, we have the same situation now.
Btw, I don't like having PRs in this early phase of development. PRs just slow it down. I think we can all directly work in the master. PRs are there to avoid breaking anything, and to discuss and review the code. We cannot really break anything, as it is not functional anyway. And for discussion, we can also discuss right here. And for review, you should feel responsible to do that anyway.
If sth is unclear, better discuss here before starting to implement anyway. If we have some rough understanding on how to implement it, we should just start, and then iterate on the code, in master directly.
I checked the implementation of datasets for Fairseq and k2 (lhotse repo). Actually, both inherit from torch.utils.data.Dataset
as opposed to torch.utils.data.IterableDataset
, so both are map-style datasets. I've seen some tendency from the community to choose map-style datasets as opposed to iterable-style datasets even for big datasets, for instance here and here, the arguments being that the DataLoader
can handle the batches and iteration over the dataset (or one can do it in a custom way, see k2 below), memory management can still be efficient with a Dataset
and additional things have to be taken care of when using IterableDataset
(see here, example 2).
k2
The __getitem__()
method of the K2SpeechRecognitionDataset
class directly returns a batch, which could allow for "custom" batches. Because of this, the DataLoader
doesn't actually compute any batches, so when declaring it, batch_size
must be equal to None
. They also use a custom sampler defined here. An example usage as shown in the last link is:
dataset = K2SpeechRecognitionDataset(cuts)
sampler = SimpleCutSampler(cuts, shuffle=True) # cuts == input data + text + potentially useful data
loader = DataLoader(dataset, sampler=sampler, batch_size=None)
for epoch in range(start_epoch, n_epochs):
sampler.set_epoch(epoch) # Only for shuffling purposes
train(loader)
I wasn't able to find the definition of __len__()
for that class though. Maybe if the data is never indexed, its definition can be avoided?
Fairseq
For Fairseq I focused on the SpeechToTextDataset
subclass of FairseqDataset
. The __getitem__()
method returns an object of the class SpeechToTextDatasetItem
, which also has the source audio and the target text. The collater() method is in charge of creating the batches. I assume it's the collate_fn
passed to the DataLoader
(maybe through *EpochBatchIterator
classes), but I couldn't find any evidence of it.
The __len__()
method simply returns the number of audios, calculated when the dataset is initialized.
In the LibriSpeech data preparation example, the data is processed and saved to a .tsv
file, which is then loaded in the from_tsv()
method when executing $ fairseq-train
.
Fairseq also has a FairseqIterableDataset
class which inherits from torch.utils.data.IterableDataset
, but it doesn't seem to be used anywhere.
In general, I would trust the Fairseq developers more, esp w.r.t. how to use PyTorch in the best way. So, let's follow how they do it.
So this means that we should implement DatasetWrapper
as inheriting from torch.utils.data.Dataset
instead of torch.utils.data.IterableDataset
?
I have found two short tutorials on how to use a big enough dataset: this short tutorial from the Stanford University on efficiently obtaining data to the dataset from disk by using a torch.utils.data.Dataset
, and this Medium post whose comments I think have good insight. I leave them here in case they're useful in any way.
When wrapping the RETURNN dataset, I don't think any of these tutorials can be applied. You can just use the RETURNN Dataset API, nothing else.
The Torch map-based dataset unfortunately does not fit too well to RETURNN as most RETURNN datasets do not really allow for efficient random access. However, I think HDFDataset
actually should be fine. I think if it works with that, this is ok for now.
We can also do both torch.utils.data.Dataset
and torch.utils.data.IterableDataset
and then the user can decide. I'm still sure that even for torch.utils.data.IterableDataset
, you can do everything what we need.
Remember that for the beginning, we just want to have some proof-of-concept. It's totally ok if we have static batch sizes for that.
The interface of RETURNN datasets is iterable style. I would first wrap that and Dataset
instead of IterableDataset
wouldn't really fit here. Maybe later a map-style wrapper for HDFDataset or something like that would be nice though.
As Nahuel said, Fairseq does not use IterableDataset
. I found one usage in huggingface/transformers, there they use one instance to provide sequences and then another instance which wraps the first one and does the batching: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L851
That sounds like a good idea to me.
I moved the init_seq_order
into the DatasetWrapper.__iter__
. I think this is cleaner.
python3 rnn.py demos/demo-torch.config
runs now. I.e. it constructs the net, calcs and optimizes the loss.
Many things are missing. The model is always random in each epoch. No storing/saving implemented yet.
In this current state, I don't use ExternData
. Probably we want to introduce this. Not sure.
But for getting a first proof-of-concept, we maybe also don't need it yet.
Current state:
Data
/ExternData
or whatever), this must be via PR.torch.nn.LSTM
) or a Conformer, just using code from ESPnet.extern_data
to define the inputs, and have a new model_outputs
just like extern_data
to define the model outputs (as Data
templates), and a model_forward
function, which gets those inputs and returns such outputs.For inference, we want to export to ONNX. It's a bit unclear how that looks like for the user. We might use
extern_data
to define the inputs, and have a newmodel_outputs
just likeextern_data
to define the model outputs (asData
templates), and amodel_forward
function, which gets those inputs and returns such outputs.
This basically means that we should also abstract ExternData
, Data
and Dim
and remove TF specific code there and move it to a common location. -> #1165 (Done now)
Such model_outputs
can also be useful for the TF backends, e.g. for the compile-TF-graph script, to clearly define what outputs are expected. -> #1166
I'm working on replicating a BLSTM from an already existing Tensorflow BLSTM. To replicate that experiment, in order of priority I think we'd need:
I think that with the basics that we have I should be able to kickstart a basic training, though.
A list of things that would be handy for future trainings:
extern_data
As far as I know, L2 is implemented as the weight_decay
parameter in the optimizers. However, it looks like it also affects layers like BatchNorm? https://discuss.pytorch.org/t/weight-decay-in-the-optimizers-is-a-bad-idea-especially-with-batchnorm/16994. It doesn't need to be applied to some layers and one can filter them out, like so: https://discuss.pytorch.org/t/weight-decay-only-for-weights-of-nn-linear-and-nn-conv/114348Note, I was thinking a bit about how to go forward later with the PyTorch backend of RETURNN, and whether to support the net dict, or how else we should do it. I think what we have in RETURNN-common, the nn
API, could be also a more direct API for it. See https://github.com/rwth-i6/returnn/issues/1264.
I heard good things about PyTorch-lightning. One important feature is that it easily supports multi-GPU training, where you don't really need to change anything on the user code.
As we already discussed, PyTorch for RETURNN should be modular, such that the user can easily combine it with PyTorch-lightning or other things.
However, when the user just uses the standard RETURNN training loop, we should maybe copy some of the aspects of PyTorch-lightning which allow for this easy multi-GPU training support. Someone should do some research on how this is done in PyTorch-lightning.
I implemented and pushed dev set evaluation and learning rate scheduling. I just reused the existing LearningRateControl
code, so all different scheduling options should be supported. For now I only calculate the "score" and not the "error" of the eval sets as that's what the train_step
function calculates at the moment. (We could rename it if we keep using it for eval too...). And normalization of the loss could be improved: right now I just average over the steps, which is what is done in the PyTorch "Quickstart" example. Better would be to weight the steps by total number of time steps, that's what the TF backend does...
I just noticed that another thing still missing is keeping the optimizer state. Both between epochs - currently we recreate it in every epoch - as well as saving the state to file to continue training. Will look into that on Wednesday. In general, do we want to make the kind of optimizer configurable or just always use AdamW
for now?
For now I only calculate the "score" and not the "error" of the eval sets as that's what the
train_step
function calculates at the moment.
This is wrong. train_step
only returns the total loss, only for the gradient computation. We must change the interface to support individual losses (scores, and maybe also errors), and also to be able to properly do normalization and accumulation.
I just noticed that another thing still missing is keeping the optimizer state. Both between epochs - currently we recreate it in every epoch
Yes this should be created only once, after the model is created.
as well as saving the state to file to continue training.
We don't even do that for TF. (Intentionally, to keep the behavior the same to Theano...)
In general, do we want to make the kind of optimizer configurable or just always use
AdamW
for now?
At some point it should be configurable (obviously). But if your question is whether this is now very urgent and important for this proof-of-concept, then I guess no?
So then we need to move the optimizer creation to init_train_from_config()
. And the learning rate would be controlled by a torch.optim.lr_scheduler.LRScheduler
instance, at least that's the standard approach. Theres no built-in LRScheduler
which is designed to set an adaptive learning rate from outside. Should be doable though with LambdaLR. Of course we could also use ReduceLROnPlateau
and let PyTorch do the scheduling, but then we would lose all the fine-grained config options that we have for LR scheduling.
Writing a custom LRScheduler
also seems simple, but I would rather stick to the standard API. Then we could for example also make the LRScheduler
modular, i.e. user configurable at some point.
I think we should use just our own LR scheduler, just exactly following what we have in the TF engine.
Next steps:
TrainCtx.mark_as_loss(name, loss, scale=1)
), save it properly in the learning rate control scores (@patrick-wilken)ChunkingDataset
(PyTorch dataset) (@patrick-wilken)optimizer
being a string or dict. For AdamW
, have a variant KarpathyAdamW
or so, see Karpathy MinGPT here, exclude params for weight decay. Also allow optimizer
as callable to allow the user to create any custom optimizer with any custom settings.dynamic_learning_rate
at some point (not so important now though).Update (still TODO):
TrainCtx.mark_as_loss(name, loss, scale=1))
, save it properly in the learning rate control scores (@patrick-wilken )optimizer
being a string or dict. For AdamW
, have a variant KarpathyAdamW
or so, see Karpathy MinGPT here, exclude params for weight decay. Also allow optimizer
as callable to allow the user to create any custom optimizer with any custom settings @Icemole dynamic_learning_rate
at some point (not so important now though) @albertz can you clarify what this is?batch_size
as defined for the tensorflow backend also for pytorch data loader @patrick-wilken dynamic_learning_rate
is a function which returns a learning rate per each minibatch/step.
Update (still TODO):
TrainCtx.mark_as_loss(name, loss, scale=1))
, save it properly in the learning rate control scores (@patrick-wilken )optimizer
being a string or dict. For AdamW
, have a variant KarpathyAdamW
or so, see Karpathy MinGPT here, exclude params for weight decay. Also allow optimizer
as callable to allow the user to create any custom optimizer with any custom settings. Mostly done, parameter exclusion needs discussion @Icemole dynamic_learning_rate
at some point. Dummy implemented, still WIP [@Icemole]batch_size
as defined for the tensorflow backend also for pytorch data loader @patrick-wilken I've pushed a change (https://github.com/rwth-i6/returnn/commit/4c4687fc3f0cc1ea8442c7f4225d855ee057cfc8) that allows customizing the optimizer object through the config. I tried to make it exactly the same as the TF one. I tested it quite a bit, but feel free to test it for yourselves or give any feedback to the code.
This version still doesn't exclude parameters for weight decay, I will get to it now. I think it could be beneficial to consider not only AdamW
for the parameter exclusion, but also all optimizers that have weight_decay
as parameter. What do you think?
@Icemole Please always follow PEP8 and other Python conventions.
returnn/torch/updater.py:17: WEAK WARNING Pep8CodeStyle: E302: expected 2 blank lines, found 1
returnn/torch/updater.py:33: WEAK WARNING Pep8CodeStyle: E302: expected 2 blank lines, found 1
returnn/torch/updater.py:47: WEAK WARNING Pep8CodeStyle: E127: continuation line over-indented for visual indent
returnn/torch/updater.py:66: WEAK WARNING Pep8CodeStyle: E124: closing bracket does not match visual indentation
returnn/torch/updater.py:71: WEAK WARNING Pep8CodeStyle: E302: expected 2 blank lines, found 1
returnn/torch/updater.py:83: WEAK WARNING Pep8CodeStyle: E302: expected 2 blank lines, found 1
returnn/torch/updater.py:108: WEAK WARNING PyMethodMayBeStaticInspection: Method <code>get_current_step_learning_rate</code> may be 'static'
returnn/torch/updater.py:188: WEAK WARNING Pep8CodeStyle: E501: line too long (124 > 120 characters)
For the parameter exclusion for weight decay, I had a couple of questions:
weight_decay
parameter, but according to the docs this parameter is actually L2 regularization, which is applied to the gradients and not to the weights. Adadelta
, Adagrad
, NAdam
, even base Adam
, etc. have this form of "weight decay" which is actually documented as L2 regularization. So, as you said, the restriction should be applied only to AdamW
.Linear
, LayerNorm
and Embedding
modules. I thought that we could create some sort of blacklist with specific modules (LayerNorm
and Embedding
, according to the minGPT implementation), while accepting everything else (Linear
, Conv*
, etc.) (biases wouldn't be accepted though). What do you think?If you think 2. is a good idea, is there some paper that specifies which modules not to pair with weight decay? Is it only LayerNorm
and Embedding
?
Karpathy's implementation can be a starting point. I think it is already generic, although it was just made complete enough to work well with GPT, but I think it's easy to extend. You don't need to reuse the exact same code, but probably it will end up being similar. Of course, any such implementation is never really complete, or perfect, but I think this is a reasonable start. He has both a blacklist and a whitelist. You can extend based e.g. on here, here. Also add LSTM.
As I see it, he actually does not have a default fallback. I'm also not really sure how the default fallback should look like. You can check if there is a bias
attribute, and if that is a parameter, exclude it as well. Otherwise add all params.
L2 is not exactly the same as weight decay, at least not in general. We should always use weight decay. Actually no-one really does that in RETURNN yet, even though we also have that in RETURNN (called decoupled constraints, decoupled weight decay). See also here.
I have just pushed changes to support multiple losses. @Icemole , requires changing e.g.
train_ctx.mark_as_loss(loss)
to
train_ctx.mark_as_loss(name="cross_entropy", loss=loss)
Let me know if something doesn't work or looks bad.
Also the chunking is now on the master. Regarding parse_chunking()
: I did not reuse the existing function because I did not implement (yet?) all valid TF backend settings, in part because there are just so many ways to define chunking that I wanted to clean up. Most controversial is probably that I dropped this "size:step" string syntax in favour of always requiring a tuple. I can add that back if people are used to it... Also, I have not yet added support for a callable "custom_chunking_func", not sure if this is needed. Finally, the existing function allows the chunk size to be 0 for some keys. I don't really understand why, instead you can just not specify chunking for that key. Maybe I just don't understand, but for now I changed the check that chunk sizes are greater zero for all keys instead of just for one.
That is actually a more general question, as discussed in the meeting today: do we want to use the opportunity of a new backend to improve the configuration or is compatibility between the different backends more important?
Like for the optimizer options, right now Nahuel added even this inspired by similar things for the TF backend:
for opt_name, optim_class in _OptimizerClassesDict.items():
if self.config.is_of_type(opt_name.lower(), float) or self.config.bool(opt_name.lower(), False):
So not only do we add strange things like SGD = 0.5
as valid config options, but also the set of valid config options depends on inspection of the PyTorch module... π
If you ask me I would just support optimizer = {"class": "Adam", **kwargs}
and probably also allow class
to be a class type (I don't use that, but seems to be needed...).
... in part because there are just so many ways to define chunking ...
I don't understand this argument. This is exactly what parse_chunking
would handle for you. By just calling that function, you don't need to care about it.
Finally, the existing function allows the chunk size to be 0 for some keys. I don't really understand why, ...
I don't exactly remember the behavior. But e.g. consider the case where you have speaker IDs, or some other global info per sequence, where you don't want to apply chunking on, while you want to apply chunking on the other keys. Maybe 0 does exactly that, i.e. it would always put the unchunked data for such key.
do we want to use the opportunity of a new backend to improve the configuration or is compatibility between the different backends more important?
This is obvious, or not? I also said that a couple of times already: Of course we should not reimplement any deprecated behavior for new code. Setting adam = True
or SGD = 0.5
or things like that are even not allowed anymore with latest behavior version. Why would you want to add such code to PyTorch? That does not make sense to me. Keep the PyTorch code simple. Never add any deprecated features, and also no features which we don't really need.
@Icemole Please always check your commits.
returnn/torch/updater.py:171: WEAK WARNING PyUnusedLocalInspection: Local variable 'decay' value is not used
returnn/torch/updater.py:219: ERROR SyntaxError: Statement expected, found Py:COLON #loc
returnn/torch/updater.py:219: ERROR SyntaxError: Statement expected, found Py:DEDENT #loc
returnn/torch/updater.py:219: ERROR SyntaxError: Statement expected, found Py:ELSE_KEYWORD #loc
returnn/torch/updater.py:219: ERROR SyntaxError: Unexpected indent #loc
returnn/torch/updater.py:219: WEAK WARNING Pep8CodeStyle: E113: unexpected indentation
returnn/torch/updater.py:220: WEAK WARNING Pep8CodeStyle: W391: blank line at end of file
Also, SyntaxError means, this file could not even be loaded. We probably also should add some simple test cases for PyTorch. Like test_TFEngine
, e.g. test_PTEngine
, for higher level tests. E.g. one test which just starts training.
Edit Ah, I see you already fixed it.
Just implemented and committed "batch_size as defined for the tensorflow backend".
Hope you like the design, I added another dataset wrapper that groups the iterator of sequences into batches. So the output of the dataset is now already batched, but not yet merged into a single padded array per data key. This still happens in collate_batch()
.
I've pushed the parameter splitting for weight_decay
in the optimizer (https://github.com/rwth-i6/returnn/commit/e2c10d11c0a49b5964dfc24e94e6acc455e125aa). For it to have any effect, decouple_constraints = True
needs to be set in the configuration file.
I'll be looking at dynamic learning rate scheduling next.
For it to have any effect,
decouple_constraints = True
needs to be set in the configuration file.
Why? This does not really make sense to me. decouple_constraints
is just about having the constraints decoupled from the loss but instead directly in the optimizer. This is what we just anyway always have in PyTorch. So setting this should have no effect. Or maybe you can throw an error when the user explicitly sets decouple_constraints = False
in the config.
I'll be looking at dynamic learning rate scheduling next.
You mean dynamic_learning_rate
? Is this really so important? Is this needed for any of the setups we want to reproduce in the upcoming time? Otherwise I would suggest to ignore this for now. We did not have this in the TF backend for a long time because no-one ever needed it. Only recently some people started to use it.
The normal LR scheduling logic (Newbob or whatever) already works?
Or maybe you can throw an error when the user explicitly sets decouple_constraints = False in the config.
Ok, I'll modify this as required.
The normal LR scheduling logic (Newbob or whatever) already works?
Yes, LR scheduling already works.
do we want to use the opportunity of a new backend to improve the configuration or is compatibility between the different backends more important?
This is obvious, or not? I also said that a couple of times already: Of course we should not reimplement any deprecated behavior for new code. Setting
adam = True
orSGD = 0.5
or things like that are even not allowed anymore with latest behavior version. Why would you want to add such code to PyTorch? That does not make sense to me. Keep the PyTorch code simple. Never add any deprecated features, and also no features which we don't really need.
I agree to this, but it is not easy to spot the difference between deprecated and intended features if you are not extremely familiar with the history of RETURNN (most configs that are around use deprecated parameters).
As I am now setting up a PyTorch demo setup for Rescale I will help out with this when available.
@patrick-wilken I now looked at the chunking/batching implementation. I saw, you implemented those transformations as datasets, deriving from torch.utils.data.IterableDataset
(btw, torch.IterableDataset
is wrong; please always check for correct typing). But is this the canonical way for PyTorch? This does not seem very PyTorch-like to me.
As Nahuel said, Fairseq does not use
IterableDataset
. I found one usage in huggingface/transformers, there they use one instance to provide sequences and then another instance which wraps the first one and does the batching: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L851
I'm not sure what you mean. I don't see that there. Actually, it looks like they do it via the DataLoader
, as I also suggested. They have a DataCollatorWithPadding
to do batching.
We should document a bit why we did things the way we did, and also reflect whether we did it the right way or whether we should maybe change things. In general, we should try to follow common PyTorch guidelines or other common PyTorch code and not deviate too much.
Some documentation:
Relevant for us, there are transforms and collate_fn. From a quick glance, transforms are transformations per sequence, so maybe not useful for chunking and batching. But both chunking and batching seems to me as if they should be implemented via collate_fn
?
Edit Originally, this issue was about a proof-of-concept for a new PyTorch backend in RETURNN. This has somehow evolved into a whole new generic frontend API (original idea here: https://github.com/rwth-i6/returnn/issues/1264), which very much follows the API from RETURNN-common
nn
, which covers multiple backends. This API is accessible for the user viareturnn.frontend
, and the convention for the user to use it would be like:We also made sure that our
Tensor
class (earlier calledData
) supports any raw tensor type (type ofTensor.placeholder
, orTensor.raw_tensor
now) and is backend independent. This is all inreturnn.tensor
now.Currently, the following backends are relevant:
NameCtx
.torch.Tensor
.tf.Tensor
.The terminology on frontend/backend is sometimes used a bit inconsistent. We mean frontend or frontend API basically to describe what the user sees, what you have in the
returnn.frontend
namespace, i.e. functions likereduce
anddot
, or also modules likeLinear
. This API is basically very much the same as RETURNN-commonnn
.We have an abstract base class
Backend
, which defines some core functions of the API and allows to reimplement it for different backends. The derived classes are e.g.TorchBackend
,ReturnnLayersBackend
,TFBackend
. The earlier terminology was maybe a bit confusing: They implement some frontend functions for the specific backend. So sometimes we referred to this as "different frontend (implementations)".The
Backend
class and its different backend implementations is supposed to be internal to RETURNN and not directly exposed to the user. The user has some helper functions to switch the backends.There is also a lot of code which builds on top of the backend functions. E.g. the
rf.Module
class, modules likerf.Linear
, would all be independent from the backend. Or also functions likecross_entropy
.The user in the end needs to define the following functions in the config:
train_step
would only be used in training. Here the user should callmark_as_loss
on some tensors.forward_step
would be used for recognition. Here the user should callmark_as_output
on some tensors. See #1336 for more discussion on this, how it would be used then for beam search, forwarding, or whatever you want to do with the outputs.To also support PyTorch modules more directly,
get_model()
can also return atorch.nn.Module
. See https://github.com/rwth-i6/returnn/issues/1120#issuecomment-1482924842.To add losses when using a raw
torch.nn.Module
, the API insidetrain_step
would look like:But this is intended only to be used for external code, and for our own code, we recommend the use of the RF. But in any case, it will be easy to mix pure PT and RF code together (with the PT backend).
To get access to the step inside
train_step
, there will be sth likerf.global_train_step()
.Related:
Some summary of the current open discussions, items, or status updates:
rf.control_flow_ctx
or not? (https://github.com/rwth-i6/returnn/issues/1288)check_matched_dataset
for PT? (https://github.com/rwth-i6/returnn/issues/1120#issuecomment-1476556666)Done:
seq_tag
inextern_data
(also see #1330)rf.scan
for beam search (#1327)rf.top_k
rf.cond
,rf.while_loop
,rf.scan
(https://github.com/rwth-i6/returnn/issues/1282)Conformer
(https://github.com/rwth-i6/returnn/issues/1120#issuecomment-1479156121)rf.Conv1d
,rf.max_pool1d
, etcrf.BatchNorm
,rf.LayerNorm
and other normalizationsrf.SelfAttention
,rf.dot_attention
rf.State
as base (https://github.com/rwth-i6/returnn/issues/1120#issuecomment-1478108726)rf.dropout
rf.cond
(minimal initial support) (see #1282)rf.get_run_ctx().train_flag
rf.cross_entropy
.rf.Linear
, some activation functions.get_model
etc) for PT and TF (@albertz), also #1290Tensor
/Dim
? (https://github.com/rwth-i6/returnn/issues/1120#issuecomment-1483250726) -> for now, we continue and focus on RF development, and the RF itself can be used as PyTorch extensionsrf.random
in tests (https://github.com/rwth-i6/returnn/issues/1283)torch.jit.script
/torch.jit.trace
) for ONNX? Tracing (torch.jit.trace
) should anyway always work, except it would not cover dynamic control flow.torch.compile
for ONNX will be the future but does not work yet? Maybe also OpenXLA instead of ONNX. Any other reasonable way to get PT models running within RASR? We can also simply use the RASR-Python bridge. (https://github.com/rwth-i6/returnn/issues/1289)preload_from_files
(https://github.com/rwth-i6/returnn/pull/1292)ReturnnDatasetIterDataPipe
), chunking, batching, usingDataLoader2
.This issue here is also to discuss and report on implementation details of the new PyTorch backend.
The initial step would be to just get a proof-of-concept, meaning the goals currently are:
Getting it compatible to the TF backend is maybe the ultimate goal, but this is on a completely different level than the proof-of-concept discussed here.
For the dataset, we would implement a PyTorch
IterableDataset
to wrap any RETURNN dataset.