rwth-i6 / i6_experiments

Mozilla Public License 2.0
8 stars 13 forks source link

How to integrate `returnn_common.nn` into Sisyphus pipelines #63

Closed JackTemaki closed 1 year ago

JackTemaki commented 2 years ago

When trying to create Returnn networks with returnn_common.nn, the following issues or questions appeared:

If returnn_common.nn would be used for construction within the manager:

As we will not solve the hashing issue without spending way to much time not doing experiments, I would propose to go with a completely different approach than we did before, which is:

-> Create a job which takes a executable and some pre-defined parameters does the network construction as task instead of in the manager.

This of course would mean that the construction result is not hashed, but only the constructor location and the parameters. To be able to find potentially unwanted changes and to be able to reconstruct the networks exactly, the job should create a local copy of the used construction code at runtime (returnn and returnn_common can of course correctly be identified by their git hash).

While this would require some manual management of the network construction to not make logic changing mistakes, I currently do not see any suitable alternative to this.

Atticus1806 commented 2 years ago

I like this approach. A mistake in the building of the config which was seen too late can then be fixed by calling tk.remove_job_and_descendants on the construction job

albertz commented 2 years ago

A specific RETURNN and Tensorflow dependency is added to the manager environment

I don't see how this is relevant.

The RETURNN version should not have any influence on returnn_common.nn at all (as long as new enough but otherwise it would simply crash with import errors, e.g. when Dim is not available).

Also the TensorFlow version should not have any influence on returnn_common.nn.

Loading Tensorflow slows down manager startup time significantly (especially when there is a slow fs somewhere)

We already discussed about that here: https://github.com/rwth-i6/returnn_common/issues/27

returnn_common.nn itself really uses the nesting logic from TF, which we could also replace.

returnn_common.nn also uses some data structures from RETURNN like Data and Dim. In principle, they could also be independent from TF. But this should be done on RETURNN side.

So, it's possible, although this would require some work.

But also, it sounds a bit like an artificial problem, which we could just directly solve (just make TF import fast), instead of putting lots of effort into it to work around it.

We also already discussed how to solve this directly, i.e. how to make TF import fast. There are many solutions to that.

The network construction logic is quite complex, so with a rising number of network constructions this will become slow (lets say 100 experiments with 20 pre-training stages each = 2000 construction calls for a manager/console startup).

I don't think the net construction really adds any noticeable overhead in runtime. This should not be the case. It should be just as fast as before. If this is not the case, we can easily fix that. There is no reason why this should not be fast.

I also disagree that it is complex but this is maybe subjective. Anyway, complexity does not imply slowness.

Do you have any numbers on that? It really should not be slow. Even not 2000 calls, or even not 1M calls. If you think this is slow, again, we can fix this. Please open an issue on this with some numbers.

The returnn_common version can be "pinned" before updating by using the git-hashed based python import call from Returnn, but the Returnn version itself can not be pinned. Nevertheless the construction might be influenced by the Returnn version,

No, it cannot be influenced by the RETURNN version.


Anyway, to your main proposal here: What I don't understand is: When you now have the construction code somewhere else (where exactly?), why do you need it at all as a separate job or task? Why not directly have this code in the RETURNN config and run it on-the-fly when RETURNN starts? What is the benefit to separate this?

To be able to see the net dict more directly? So you make this single thing a bit easier (it would also be easy to dump it on-the-fly if you want to see it) while making the whole setup overall much more complicated?

JackTemaki commented 2 years ago

Setting the other comments aside (because we do not need to fix this if we can not work like this anyway).

I don't think the net construction really adds any noticeable overhead in runtime. This should not be the case. It should be just as fast as before. If this is not the case, we can easily fix that. There is no reason why this should not be fast.

I also disagree that it is complex but this is maybe subjective. Anyway, complexity does not imply slowness.

What "before"? There is no "before". We constructed the network dicts, which is merging some pre-defined dictionaries together and of course much faster than the new construction logic.

I just did the timing on one Transformer construction and it is definitely too slow (not in general, but to do it in the manager). Just one single construction was 2.3 seconds.

When you now have the construction code somewhere else

I am not sure how you read this out of my proposal, the location does not change at all.

Why not directly have this code in the RETURNN config and run it on-the-fly when RETURNN starts? What is the benefit to separate this?

This is one way of doing it, using it with the `ReturnnTrainFromFileJob", which of course would make all of the mentioned issues non-existing. But this means you have limited interaction with Sisyphus, and your own defined modules always have to be copy&pasted in there (unless they are part of returnn_common) and you can have no additional helpers. But this is in a way saying "do not use Sisyphus for anything Returnn config related". Which is a valid approach, but not what should be discussed. So referring to the title of this issue your answer is "don't do it".

albertz commented 2 years ago

What "before"? There is no "before". We constructed the network dicts, which is merging some pre-defined dictionaries together and of course much faster than the new construction logic.

Before = before you used returnn_common.nn. You have used Sisyphus before or not? You have created dictionaries before or not?

No, this is what I'm saying: It should not make any noticeable difference. Why should it? If there is really any noticeable slowdown, we can fix this. There is really no reason why there should be any noticeable slowdown due to the new construction. Even if you think it is complex.

I just did the timing on one Transformer construction and it is definitely too slow (not in general, but to do it in the manager). Just one single construction was 2.3 seconds.

What exactly did you measure? Did you maybe count the TensorFlow import? The construction itself really should not be slow.

When you now have the construction code somewhere else

I am not sure how you read this out of my proposal, the location does not change at all.

Then I don't understand your proposal. You proposed to do the construction in a job? So that means the location is now somewhere difference, i.e. in the job. If you don't mean that, what do you mean then?

Why not directly have this code in the RETURNN config and run it on-the-fly when RETURNN starts? What is the benefit to separate this?

This is one way of doing it, using it with the `ReturnnTrainFromFileJob", which of course would make all of the mentioned issues non-existing. But this means you have limited interaction with Sisyphus,

I don't see the difference to having the construction in a job, what you proposed here. What exactly is the difference?

and your own defined modules always have to be copy&pasted in there (unless they are part of returnn_common)

No, of course you don't need to copy&paste anything. Why? I don't understand.

Also, again, I don't understand how this is different to your proposal here.

and you can have no additional helpers.

Sure you can have. Why not?

And again I don't understand the difference to your proposal here.

But this is in a way saying "do not use Sisyphus for anything Returnn config related". Which is a valid approach, but not what should be discussed. So referring to the title of this issue your answer is "don't do it".

No. At the moment I just try to understand how your proposal really has any benefit, or what your are actually proposing, and how that is different to what I'm saying.

JackTemaki commented 2 years ago

It seems this topic is too complicated to be discussed in an issue here, as we are lacking some common ground, we should take this offline.

In short:

Why should it?

Because executing hundreds of lines of code is certainly slower than merging dicts.

What exactly did you measure?

    with nn.NameCtx.new_root() as name_ctx_network:
        net = BLSTMDownsamplingTransformerASR(...)
        out = net(...)
        out.mark_as_default_output()
        for param in net.parameters():
            param.weight_decay = 0.1
        serializer = nn.ReturnnConfigSerializer(name_ctx_network)
        base_string = serializer.get_base_extern_data_py_code_str()
        network_string = serializer.get_ext_net_dict_py_code_str(net, ref_base_dims_via_global_config=True)

and you can have no additional helpers.

Sure you can have. Why not?

You can not import from the recipes in a returnn config file. A job (as in: a generated job folder in work dir) has to be independent of any outside influence that is not part of the inputs.

So that means the location is now somewhere difference, i.e. in the job

just because the construction is called in a job does not mean the code that does construction is in a different location (like physical as file)

albertz commented 2 years ago

It seems this topic is too complicated to be discussed in an issue here,

I don't see why this issue here is complicated.

as we are lacking some common ground, we should take this offline.

Sure, we can do that, but it would be good to have it in written form anyway here such that others can also join the discussion or understand decisions later on. So it would be anyway good if you could answer my questions.

I probably will not be able to come to the office today in person. But probably tomorrow.

In short:

Why should it?

Because executing hundreds of lines of code is certainly slower than merging dicts.

But this is what I mean about noticeable. Hundreds of lines of code should still execute in the nano-seconds range. This is still very fast. Unless sth is wrong.

What exactly did you measure?

    with nn.NameCtx.new_root() as name_ctx_network:
        net = BLSTMDownsamplingTransformerASR(...)
        out = net(...)
        out.mark_as_default_output()
        for param in net.parameters():
            param.weight_decay = 0.1
        serializer = nn.ReturnnConfigSerializer(name_ctx_network)
        base_string = serializer.get_base_extern_data_py_code_str()
        network_string = serializer.get_ext_net_dict_py_code_str(net, ref_base_dims_via_global_config=True)

This might involve imports.

Otherwise, I don't see why this should not be fast. If there is anything slow, we can certainly fix this. Some profiling would be good.

and you can have no additional helpers.

Sure you can have. Why not?

You can not import from the recipes in a returnn config file. A job (as in: a generated job folder in work dir) has to be independent of any outside influence that is not part of the inputs.

I don't understand. Sure you can import from the recipes, or from anywhere else. What stops you from doing that?

Sure a job is influenced on many other things, like the content of the filesystem, Sisyphus, Python, the env, the Python env, etc. The recipes can just be part of the Python env.

I don't really understand your argument here. You say you cannot have helpers because of some artificial constraint you are making up that there should be no shared common code, i.e. there should be no helpers? So you say you cannot have helpers because there must not be helpers. But why? This does not make sense to me.

I also still don't understand how the situation is different to what you propose. You propose to do the net construction inside a job, or not? So where does the code for the net construction come from?

So that means the location is now somewhere difference, i.e. in the job

just because the construction is called in a job does not mean the code that does construction is in a different location (like physical as file)

Yes, exactly as I'm saying. You can do it as you want. It does not matter. Although I'm not really sure if you propose to change that or not, or where exactly you would want the the code to be. What do you actually mean when you write "constructor location"?

What changes is where the code is executed, e.g. inside a job or in Sisyphus. E.g. inside the net construction job or in the ReturnnTrainJob.

And again, how does the situation change when this is inside a new custom net construction job, or just directly in the ReturnnTrainJob? I don't see how there is any difference.

JackTemaki commented 2 years ago

This might involve imports.

I called this repeatedly to make sure this is not the case.

Sure a job is influenced on many other things, like the content of the filesystem, Sisyphus, Python, the env, the Python env, etc.

All accessed content should be local in the job folder or from the input folders. Sure, if you manually edit those it will break, but this is not something commonly done, so the first thing can be excluded. Functions from Sisyphus are usually (and should not) be used inside job tasks except for path resolving. Python, yes Python can have an influence but people usually do not change their major python version while working with the same setup folder. The env, yes, the env can change, but optimally (this is unfortunately not the case yet), you have a job for the env, then this is not an issue.

So all the things you listed are not something you alter on a daily basis. Your personal recipe code is edited on a daily basis, as this what you work with. Importing that externally is a very bad idea.

that there should be no shared common code,

I did not say that. This is what returnn_common is for or not? And this can be safely imported as it would come from a git-clone job. But you should not access your own recipe code. Of course you could include that in your python path, I am just saying this is not a good idea, as if you change something running the same job again will result in something different.

I don't see how there is any difference.

The difference is that this job would need a different kind of parameter passing, it would need fixed parameters for returnn and returnn_common, and it would need to make a local copy of the files that were used to build the network.

albertz commented 2 years ago

I think I don't really understand what code you are talking about when you say net construction code, and where you would have this code. I was not talking about returnn_common itself. I was referring to the code where you actually put it together to construct the network. Or other helper code you use. Anything basically by the user. And when you do research on modeling aspects, this is what you would constantly change.

In any case, where-ever this code is located (it does not matter), what difference does it make whether it is executed in NetConstructionJob or directly in ReturnnTrainJob? I don't see how there is any difference. What exactly is the advantage to separate them? Why not directly do the net construction in the ReturnnTrainJob? I don't understand this.

JackTemaki commented 2 years ago

What exactly is the advantage to separate them? Why not directly do the net construction in the ReturnnTrainJob? I don't understand this.

I do not want to alter the ReturnnTrainingJob that much (better not at all). This job can not:

albertz commented 2 years ago

What exactly is the advantage to separate them? Why not directly do the net construction in the ReturnnTrainJob? I don't understand this.

I do not want to alter the ReturnnTrainingJob that much (better not at all).

Why not? This could simply be an extension (if any change/extension is really needed, but see below).

So, just because you don't want to alter ReturnnTrainingJob, you want to have a new separate job? This is the reason here? This does not make any sense to me. Why?

But despite, I don't think any change or extension is needed (see below).

This job can not:

  • specify the import paths for where the construction code is located
  • manage the returnn_common dependency
  • create a local copy of the construction code for later reference

Wrong. Sure you can do all of that. You can simply put it into the ReturnnConfig object, in whatever way you like.

Also, I still don't know where you actually would have your construction code located. Inside recipes? Outside recipes? Where exactly?

But also I do not fully know how a NetConstructionJob would do that.

I don't understand. In this issue here, you wrote:

Create a job which takes a executable and some pre-defined parameters does the network construction as task instead of in the manager.

So, I just gave this job a name, and called it NetConstructionJob.

So, what do you mean? Do you propose to add such job or not?

Or what is the problem? Or you say you don't know yet how to implement it?

Anyway, this is less about if this is a separate NetConstructionJob or somehow integrated in ReturnnTrainingJob or ReturnnTrainingFromFileJob, as the problems that need to be solved are similar.

Right. That is what I am saying all the time. Actually not just similar but exactly the same. That is why I don't understand why you want to have a separate job. I don't understand the advantage. I keep asking about this but so far you have not answered why it should be or must be a separate job, or why it can't be together in the ReturnnTrainingJob.

But more about if we go for "within" Sisyphus net construction (which we originally wanted to have) or external net construction. And from what I see a complete "within" Sisyphus net construction is not realistically feasible, so the question is how to get as close as possible to that.

Ah ok. I thought this issue is not about discussing that, but about proposing actually to have it external.

On the aspect whether the complete within Sisyphus net construction is feasible: Again, I don't see why not. Whatever you think is slow can (and should) be fixed. There is no reason why it should be slow.

On the question of how to do it externally (no matter if inside ReturnnTrainingJob or in a separate job): I still don't understand how you actually propose it. Where exactly would you have the code of the net construction? How does this code get into the job?

JackTemaki commented 2 years ago

Wrong. Sure you can do all of that. You can simply put it into the ReturnnConfig object, in whatever way you like.

So you suggest to extend the ReturnnConfig code to cover that logic? This is also a possibility, but then do not say wrong as in "The job can not do that". It can not do that right now in a somewhat understandable user-friendly way. Of course we can change any code to perform any logic we want. If you talking about just passing this in prologor epilog, this surely is not a good idea because then you need to work with DelayedFormat which adds unnecessary complexity, and also it is not obvious from the interface how you should work with it.

I keep asking about this but so far you have not answered why it should be or must be a separate job, or why it can't be together in the ReturnnTrainingJob

It is adding code and parameters to the job that from my point of view do not belong in there. If there is an extra job the interface for the Job and the ReturnnConfig needs nearly no changes (only to accept one more type, which is Path).

So, what do you mean? Do you propose to add such job or not? Or what is the problem? Or you say you don't know yet how to implement it?

I am raising the possibility this could be a solution (I am not sure myself). There is no problem, I just do not see any reason yet to start implementing this as long as there is no full concept to have in mind.

There is no reason why it should be slow.

I think there is, but lets exclude that here. This is also only one aspect.

Where exactly would you have the code of the net construction?

Somewhere under i6experiments.users.rossenbach

How does this code get into the job?

One possibility of many would be providing a Path object to the package of all construction modules and another Path object to point to a python executable which accepts all dynamic parameters and does contain the actual construction.

albertz commented 2 years ago

Wrong. Sure you can do all of that. You can simply put it into the ReturnnConfig object, in whatever way you like.

So you suggest to extend the ReturnnConfig code to cover that logic?

No, you also don't need an extension to ReturnnConfig. You can put in what you want.

This is also a possibility, but then do not say wrong as in "The job can not do that".

I don't understand. I did not say the job can not do that. You said that, and I said this is wrong, because the job can do that, as I described.

It can not do that right now in a somewhat understandable user-friendly way. Of course we can change any code to perform any logic we want. If you talking about just passing this in prologor epilog, this surely is not a good idea because then you need to work with DelayedFormat which adds unnecessary complexity, and also it is not obvious from the interface how you should work with it.

About understandable/user-friendly: Sure, but this can easily be solved.

It is adding code and parameters to the job that from my point of view do not belong in there. If there is an extra job the interface for the Job and the ReturnnConfig needs nearly no changes (only to accept one more type, which is Path).

There is no change needed to ReturnnTrainingJob nor ReturnnConfig.

So, what do you mean? Do you propose to add such job or not? Or what is the problem? Or you say you don't know yet how to implement it?

I am raising the possibility this could be a solution (I am not sure myself). There is no problem, I just do not see any reason yet to start implementing this as long as there is no full concept to have in mind.

Ok. As said, this was not clear from the issue description to me. It sounds like you were proposing this specific solution.

There is no reason why it should be slow.

I think there is, but lets exclude that here. This is also only one aspect.

I don't understand. I thought this is the one and only reason why you need this at all? Assume that this can be fixed. Then the whole discussion here in this issue is obsolete or not? As I understand you, you would anyway even prefer this?

Where exactly would you have the code of the net construction?

Somewhere under i6experiments.users.rossenbach

So, inside the recipes. So it means the job accesses the recipes. But this contradicts what you wrote earlier that you don't want that?

How does this code get into the job?

One possibility of many would be providing a Path object to the package of all construction modules and another Path object to point to a python executable which accepts all dynamic parameters and does contain the actual construction.

I don't exactly understand. Can you be more specific? Maybe give an example? Python executable, you mean /usr/bin/python3? I don't really understand how the net construction code gets in there. A Path to i6experiments.users.rossenbach?

JackTemaki commented 2 years ago

We can leave the rest for now, it does not make sense to discuss, but:

There is no change needed to ReturnnTrainingJob nor ReturnnConfig.

If there is a solution that would need no code changes at all this would be a good starting point to do any experiments. So please tell me how this should work.

albertz commented 2 years ago

We can leave the rest for now, it does not make sense to discuss

I don't understand. Surely all the rest is very relevant here as well, and very important to discuss? Esp, most importantly, maybe this whole issue here is actually obsolete, as I explained? And I thought this is the main question this issue is actually about, as you explained?

There is no change needed to ReturnnTrainingJob nor ReturnnConfig.

If there is a solution that would need no code changes at all this would be a good starting point to do any experiments. So please tell me how this should work.

But I thought first the question is if this is needed at all, and a within Sisyphus construction is not possible? Now you seem to assume again that it is needed.

There are many trivial ways how you can get whatever data/code with whatever dependencies into a ReturnnConfig. You should know that. I wonder a bit what the problem is.

But the more relevant question is, which I keep asking, which is still not clear to me: Where exactly is the net construction code? And what do you want to pass exactly to the job? The code itself (a copy of it or so?), or a Path to it?

Maybe you can just give an example to that? Ignore the part how exactly it ends up in the ReturnnTrainingJob. Just show the part where you have the net construction code, and what exactly would be the input of the job (code itself, or Path, or whatever).

tbscode commented 2 years ago

Haven't read the whole discussion but I did try to use returnn_common.nn with my setup and experienced similar problems @JackTemaki mentioned here.

What I did was:

1 ) Test the network Construction via tests.returnn_helpers.config_net_dict_via_serialized and engine.init_train_from_config 2 ) Then obtain the construction code via inspect.getsource(...) 3 ) Add the code to a ReturnnConfig and use that with a ReturnnTrainingJob

Steps 1 & 2 could be done by some sort of net construction job. This could output a ReturnnConfig that contains the construction code ( which is already tested ).

I like the Idea of a net construction job, in my view this would solve following problems:

In this case the output for the net construction job could either be a returnn config file ( so a Path ), or a ReturnnConfig ( so a Variable, might be preferable to use with ReturnnTrainingJob ). The input would then be the construction code.

Am am not using returnn_common in my setup currently ( Due to some time constraints and not being to familiar with it ). Just leaving this here as a comment.

albertz commented 2 years ago

Note from personal discussion with @JackTemaki:

We came to the conclusion that such net construction job is indeed not needed and we can directly do it in the ReturnnTrainingJob. Running the returnn-common (RC) code would not happen in the Sisyphus manager but via Delayed directly in the ReturnnTrainingJob. (I still don't fully understand why it was not clear that this is possible, or can be made possible, so why we had the suggestion to make a separate job for this.)

The RC code with the model definition would be inside the users recipe dir, i.e. somewhere in i6_experiments.users.

There was the suggestion to automatically copy the model definition file into the job work dir locally such that any modifications on it afterwards to not infer with the job. But it's not clear whether this is a good idea, or whether this is even so simple because the model definition might not be a single file but also access other files from the recipes.

What remains are many small details. E.g. as said, we could pass a Delayed to the ReturnnTrainingJob. More specifically, to the ReturnnConfig, to python_epilog. The Delayed would contain another job, sth like ReturnnCommonModelSerializer or so. This could define its own custom hash, or use the default Sisyphus hashing. The ReturnnTrainingJob hash depends also on this object. It could would roughly look like:

class ReturnnCommonModelSerializer:
  def __init__(self, module_name: str, *, func_name: str = "main", func_kwargs: Dict[str, Any] = None):
    self.module_name = module_name
    self.func_name = func_name
    self.func_kwargs = func_kwargs or {}

  def get(self) -> str:
    """
    This is called by Delayed.get.
    """
    from returnn_common import nn
    import importlib
    module = importlib.import_module(module_name)
    func = getattr(module, self.func_name)
    root_module = func(**self.func_kwargs)
    return nn.get_returnn_config().get_complete_py_code_str(root_module)

  def py_code_direct(self) -> str:
    return "\n".join([
      "from returnn_common import nn",
      f"import {self.module_name}",
      f"root_module = {self.module_name}.{self.func_name}({', '.join(f'{k}={v!r}' for k, v in self.func_kwargs.items())})",
      "py_code = nn.get_returnn_config().get_complete_py_code_str(root_module)",
      "eval(py_code, globals())",
      ""
    ])

This is a very simplistic draft. This ignores pretraining now. Also, maybe we should separate extern data from model definition logic more anyway? Extern data might also partially be auto-generated from dataset?

In this example, the hash would be defined via module_name, func_name, func_kwargs.

The usage in Sis would maybe look like this:

model_def = ReturnnCommonModelSerializer("i6_experiments.users.zeyer.model.my_best_model_123")
config = ReturnnConfig(..., python_epilog=[Delayed(model_def), ...])
train_job = ReturnnTrainingJob(config, ...)

Or if we don't want Delayed but really on-the-fly execution, it could look like:

model_def = ReturnnCommonModelSerializer("i6_experiments.users.zeyer.model.my_best_model_123")
config = ReturnnConfig(..., python_epilog=[model_def.py_code_direct(), ...])
train_job = ReturnnTrainingJob(config, ...)

This example would not copy the file. In the first case with Delayed, it would still explicitly create and write the net dict to the config. This is run in the create_files task, before the run task. In the second case with py_code_direct, it would directly run the code on-the-fly in the run task.

Again, as said, many details. It depends also on how your work-flow should look like, e.g. during debugging. Currently (before RC) you could just go to the work dir and run rnn.sh manually, after maybe editing some of the files. I think we still want this easy work-flow. But with RC, you likely would not edit the net dict. So having this separate task (or even a separate job) to create the net dict makes this work-flow more complicated. In the second case with py_code_direct, you could directly edit the RC model def code.

@tbscode:

2 ) Then obtain the construction code via inspect.getsource(...)

I don't understand what you need this for. You already constructed the code, so you already have it, so why do you need to obtain it so indirectly again?

JackTemaki commented 2 years ago

@tbscode As I currently have other deadlines I postponed working on this, but the returnn_common integration will be available in 3-4 weeks I guess (after some more extensive testing by @Atticus1806 maybe).

I already have partial code / concept on how to extend the above mentioned idea by @albertz to also support pre-training and using tk.Variables or tk.Paths as flexible parameters. Also note that Delayed is not fully supported as input to python_epilog yet, see https://github.com/rwth-i6/i6_core/pull/264

tbscode commented 2 years ago

Ok I see. Yes ReturnnCommonModelSerializer also sound like a good solution, I wasn't aware that Delayed existed ( or anything of that sort ).

I don't understand what you need this for. You already constructed the code, so you already have it, so why do you need to obtain it so indirectly again?

It was convenient to use getsource since I could have the whole network definition ( with dimtags and co ) in one function, that I would pass to a test_net_create_config function which would construct the model and output the ReturnnConfig. ( Also I prefer this method from having the code in a string or external file, so I get syntax checking in the same script )

As I currently have other deadlines I postponed working on this, but the returnn_common integration will be available in 3-4 weeks I guess (after some more extensive testing by @Atticus1806 maybe).

I will not use returnn_common in my current experiments also mainly due to time constraints. Surely looking forward to using it in a future setup though.

albertz commented 2 years ago

I thought a bit further. Some loose thoughts:

I want to decouple things more, like the extern_data, the model def and the loss, so that I can combine it individually. This is to reduce the number of experiment files. Each of these would be specified via sth like ReturnnCommonModelSerializer, as some module_name: str, which is usually sth like i6_experiments.users.zeyer.exp.... And then maybe additionally some function or class or so, eg. func_name: str, func_kwargs: Dict[str, Any]. The tuple (module_name, func_name, func_kwargs) would define the hash.

More specifically:

I definitely want to decouple the extern_data definition (and its related dim tags) from the rest. Maybe also the input and target separate.

The model definition should also be decoupled from the extern_data and also from the loss definition (partly, optionally, maybe).

I want to have a common model API for hybrid HMM and CTC like setups, sth like:


class Model(nn.Module):
  def __init__(self, out_dim: nn.Dim, in_dim: nn.Dim, **kwargs):
    ...

  @nn.scoped
  def __call__(self, x: nn.Tensor, *, in_spatial_dim: nn.Dim) -> Tuple[nn.Tensor, nn.Dim]:
    ...
    return y, out_spatial_dim

Or actually maybe just ISeqFramewiseEncoder and ISeqDownsamplingEncoder.

I would expect that logits come out of this, and out_dim would be the number of labels (maybe including blank or not).

Then separately I would have the loss definition. This could look like:

# this is elsewhere:
model = Model(out_dim=output_dim + 1, in_dim=input_dim)  # +1 for blank
inputs = nn.get_extern_data(nn.Data("data", dim_tags=[nn.batch_dim, time_dim, input_dim]))
logits, out_spatial_dim = model(inputs, in_spatial_dim=time_dim)
targets = nn.get_extern_data(nn.Data("classes", dim_tags=[nn.batch_dim, targets_time_dim], sparse_dim=output_dim))

# loss:
loss = nn.ctc_loss(logits=logits, targets=targets)
loss.mark_as_loss()
decoded, decoded_spatial_dim = nn.ctc_greedy_decode(logits, in_spatial_dim=out_spatial_dim)
error = nn.edit_distance(a=decoded, a_spatial_dim=decoded_spatial_dim, b=targets, b_spatial_dim=targets_time_dim)
error.mark_as_loss(as_error=True, custom_inv_norm_factor=nn.length(targets, axis=targets_time_dim))

The model (network) def maybe could be split further, e.g. by having some preprocessing like SpecAugment.

All that would go into the python_epilog. Additionally with some boilerplate code in between, which is not supposed to be hash. And I would explicitly set python_epilog_hash.

Maybe I would introduce specifically:

class NonhashedCode:
  def __init__(self, code: str):
    self.code = code

And then it could look like:

# Define some training exp:
extern_data = ReturnnCommonModelSerializer(...)
model = ReturnnCommonModelSerializer(...)
boilerplate1 = NonhashedCode(...)
loss = ReturnnCommonModelSerializer(...)
boilerplate2 = NonhashedCode(...)
epilog = [extern_data, model, boilerplate1, loss, boilerplate2]

# common function:
def train(epilog, version=1):
  epilog_hash = (version,) + tuple(obj for obj in epilog if not isinstance(NonhashedCode))
  epilog = [obj.py_code_direct() if isinstance(obj, ReturnnCommonModelSerializer) else obj for obj in epilog]
  epilog = [obj.code if isinstance(obj, NonhashedCode) else obj for obj in epilog]

  config = ReturnnConfig(
    ...,
    python_epilog=epilog, python_epilog_hash=epilog_hash,
    ...)
  job = ReturnnTrainingJob(config, ...)
  ...

There are still some further details to be clarified, or maybe slightly fixed in the suggestions above.

JackTemaki commented 2 years ago

This is not too far away from my current code, just some comments:

The data handling obviously needs to be separated from the model construction, but be aware that calling py_code_direct() is not possible in this case, as this breaks data and e.g. label size dependencies. ReturnnCommonModelSerializer should correctly resolve variables in the task, not outside (this invalidates the fundamental concept of Sisyphus that all connections between jobs are passed as is). The problem is the current Sisyphus still allows you to do that, you will just get broken graphs (unfortunately many old setups rely on that because no one ever checked this). There is DELAYED_CHECK_FOR_WORKER which can be set to enforce correct graphs, but this is unfortunately not fully supported yet.

The manual handling of epilog/epilog_hash is also not needed, this should be done automatically by making ReturnnCommonModelSerializer and NonhashedCode define their own hashes correctly.

So basically the code can be similar to what you wrote, just the three lines after def train are not required. The "version" can (maybe better should so that it is visible in the job) be written into the dict of the ReturnnConfig.

albertz commented 2 years ago

The data handling obviously needs to be separated from the model construction, but be aware that calling py_code_direct() is not possible in this case, as this breaks data and e.g. label size dependencies. ReturnnCommonModelSerializer should correctly resolve variables in the task, not outside (this invalidates the fundamental concept of Sisyphus that all connections between jobs are passed as is).

Right. But this would only be for the extern data part.

The manual handling of epilog/epilog_hash is also not needed, this should be done automatically by making ReturnnCommonModelSerializer and NonhashedCode define their own hashes correctly. So basically the code can be similar to what you wrote, just the three lines after def train are not required.

It depends how you pass it. When you pass it as Delayed, then yes, you could just define it. When you use py_code_direct, then you need a custom hash. In my case, I would not use Delayed as I described above because I want the simple debugging workflow as I explained. I could go with yet another option: Use Delayed but this would not construct the net dict but print out what py_code_direct returns. But this is then two times an indirection, and not sure if this is so straightforward.

The "version" can (maybe better should so that it is visible in the job) be written into the dict of the ReturnnConfig.

This was intended as a convention to be used specifically for the boilerplate code in epilog. Whenever I would change the boilerplate code in this example, which is intentionally not hashed, in some way which potentially could change the behavior, I would increase the version, to get a new hash.

Btw, according to @michelwi in https://github.com/rwth-i6/i6_core/pull/266#discussion_r871056566, I anyway could not set python_epilog_hash this way because it must be a str?

JackTemaki commented 2 years ago

I could go with yet another option: Use Delayed but this would not construct the net dict but print out what py_code_direct returns

This is the only valid solution I think, otherwise this will be a little bit messy, to not treat all "objects" passed to epilog the same. At least I would not understand why some things have to be converted outside to a string an others can be passed as is.

I anyway could not set python_epilog_hash this way because it must be a str?

While this is certainly not the way it was intended this is fine to change.

I will spend the monday with @Atticus1806 to fully build and test a pipeline that covers all your and our needs, which are:

Optionally we can look at:

So far I do not see any conflict in having a solution that works for all. Well, except maybe for the network hashing part but I think this is something we can live with not having for now, everyone just has to be careful with his/her version parameter or renaming the model file to some ..._v2 name in case everything should re-run.

@albertz please intervene here if something is unclear or wrong in your eyes.

The only thing that is not fully clear is how some of the more independent dataset/datastream are going to be integrated, but this is only about personal structure/style preference and not about any technical limitation.

albertz commented 2 years ago

I could go with yet another option: Use Delayed but this would not construct the net dict but print out what py_code_direct returns

This is the only valid solution I think, otherwise this will be a little bit messy, to not treat all "objects" passed to epilog the same. At least I would not understand why some things have to be converted outside to a string an others can be passed as is.

Well, there is at least the difference that some of these are "boilerplate" which should not influence the hash in any way, by intention, and then the others which influence the hash, but only in the explicit defined way, e.g. via an explicit module_name etc.

  • separate the construction of extern_data for more flexibility and to allow syncing it directly with the code for construction the datasets

It's still not totally clear to me how you map this to the model inputs and the loss.

Maybe we would have the standard case of just data and classes and anything more custom would then also need more custom treatment.

  • Simplistic helper objects that are based on Delayed and passed to prolog/epilog, which can be extended and control both the hashing and the correct placement of code in the final returnn.config (boilerplate, extra specaug code, random stuff).

I want the boilerplate behavior to be just the same as if it would not be there. I want to be able to add further such objects into it without any change in the hash. So this means this is different to a boilerplate object where _sis_hash returns None or so.

So far I do not see any conflict in having a solution that works for all. Well, except maybe for the network hashing part but I think this is something we can live with not having for now, everyone just has to be careful with his/her version parameter or renaming the model file to some ..._v2 name in case everything should re-run.

Sure. We should put any such relevant helpers then maybe to i6_experiments/common.

JackTemaki commented 2 years ago

We now have something (partially) working, but I am quite sure that in some details this does not meet exactly what you imagined, so we should continue the discussion. My current get_config looks like this:

def get_config(
        returnn_common_root,
        training_datasets,
        **kwargs):
    """
    :param tk.Path returnn_common_root:
    :param TrainingDatasets training_datasets:
    """

    # changing these does not change the hash
    post_config = {
        [....],
        'debug_print_layer_output_template': True,
    }
    config = {
        [...]
        'optimizer': {'class': 'Adam', 'epsilon': 1e-8},
        'accum_grad_multiple_step': 2,
        'batch_size': 10000,
        'max_seqs': 200,
        [...]
    }

    from i6_experiments.users.rossenbach.returnn.nnet_constructor import ReturnnCommonSerializer,\
        ReturnnCommonExternData, ReturnnCommonDynamicNetwork, NonhashedCode

    network_file = Path("get_transformer_network.py")

    extern_data = [
        datastream.as_nnet_constructor_data(key) for key, datastream in training_datasets.datastreams.items()]

    config["train"] = training_datasets.train.as_returnn_opts()
    config["dev"] = training_datasets.cv.as_returnn_opts()
    config["eval_datasets"] =  {'devtrain': training_datasets.devtrain.as_returnn_opts()}

    recursionlimit = NonhashedCode(code=RECURSION_LIMIT_CODE)
    rc_extern_data = ReturnnCommonExternData(
        extern_data=extern_data
    )
    rc_network = ReturnnCommonDynamicNetwork(
        network_file=network_file,
        data_map={"source_data": "audio_features",
                  "target_data": "bpe_labels"},
        parameter_dict={}
    )
    serializer = ReturnnCommonSerializer(
        delayed_objects=[recursionlimit, rc_extern_data, rc_network],
        returnn_common_root=returnn_common_root,

    returnn_config = ReturnnConfig(
        config=config,
        post_config=post_config,
        python_epilog=[serializer],
    )
    return returnn_config

So this is close to what you imagined, except that we now have one Serializer which takes a list of custom objects/code etc... that do the returnn_common serialization for us. This allows user to use quite strict modules like the ones I presented here (which to exactly what I would like to have), but you can also write just any custom code you want and have less things managed by helpers.

This code still uses the data helpers I used before, which we might want to replace. I do like the idea though of having the data helpers more Sisyphus specific, as this is where the data comes from. This of course does not prohibit us from moving the "Dataset" (not Datastream) helpers to returnn_common, as those are definitely Sisyphus independent.

Another thing is that here the extern_data generation does not make use of returnn_common code, because it was just easier now to directly let the helper write the "code" definition instead of first creating "real" nn.Dim + nn.Data objects, and then calling the get_base_extern_data_py_code_str to transform it back into a string.

Maybe we would have the standard case of just data and classes

I strongly oppose this idea. You should always give understandable names to your inputs, and when using custom names you never trigger any custom behavior that was specifically written for those two entries. (Yes, we could remove this custom behavior with a new behavior version, but still it is better to just give good names and not ones which only make sense for basic ASR tasks).

JackTemaki commented 2 years ago

As additional information, the ReturnnCommonDynamicNetwork only fills (so you could just replace it by some template code for now):

        template = textwrap.dedent("""\

        network_parameters = ${NETWORK_PARAMETERS}

        ${NETWORK_CODE}

        def get_network(epoch, **kwargs):
            nn.reset_default_root_name_ctx()
            net = construct_network(epoch, **network_parameters)
            return nn.get_returnn_config().get_net_dict_raw_dict(net)

        """

@albertz maybe you have a good idea to get this more flexible, because otherwise things like a custom add_loss(net) call or seomthing would be needed to be added as additional parameters to ReturnnCommonDynamicNetwork...

I also do not like the "conventions", that are not obvious to the user, meaning that you need to have a construct_network function in your model code file. But maybe this is less problematic than I think it is.

albertz commented 2 years ago

Where can I see your code? Why is it not pushed?

How is ReturnnCommonDynamicNetwork defined?

What is NETWORK_CODE? Why do you need that?

What is NETWORK_PARAMETERS and parameter_dict? Are these really params or rather kwargs? You should use the standard Python and ML terminology. Parameters are usually model parameters. Use kwargs if you mean kwargs.

I also do not like the "conventions", that are not obvious to the user, meaning that you need to have a construct_network function in your model code file. But maybe this is less problematic than I think it is.

Why would you hardcode that? As I wrote before in my example, next to the module name (module_name = __package__ + ".get_transformer_network" in your example), you would also pass func_name. So func_name = "construct_network" in your example.

Where is the code which does the import {module_name}? Then later I would expect sth like (as you see in my example above):

f"root_module = {self.module_name}.{self.func_name}({', '.join(f'{k}={v!r}' for k, v in self.func_kwargs.items())})"

How does the data_map work? How is the extern data connected with the model? This is unclear.

Maybe we would have the standard case of just data and classes

I strongly oppose this idea. You should always give understandable names to your inputs, and when using custom names you never trigger any custom behavior that was specifically written for those two entries.

In your example, you also did just the same but now it is called source_data and target_data. I don't see how that is different.

Where and how do you define the loss, and connect the model output and extern data with the loss?

How is NonhashedCode defined?

How is ReturnnCommonSerializer defined?

How is ReturnnCommonExternData defined?

The ReturnnCommonSerializer call misses a ) at the end?

This code still uses the data helpers I used before, which we might want to replace. I do like the idea though of having the data helpers more Sisyphus specific, as this is where the data comes from. This of course does not prohibit us from moving the "Dataset" (not Datastream) helpers to returnn_common, as those are definitely Sisyphus independent.

You probably refer to #55. I still think this should all be in returnn_common, esp also Datastream. I mean specifically all the data structures. Not the pipeline logic. The pipeline logic (which covers where the data comes from) should be here in i6_experiments/common.

Another thing is that here the extern_data generation does not make use of returnn_common code, because it was just easier now to directly let the helper write the "code" definition instead of first creating "real" nn.Dim + nn.Data objects, and then calling the get_base_extern_data_py_code_str to transform it back into a string.

I don't really understand. How do you get the nn.Datas and nn.Dims then, and how do you make the nn.get_extern_data(...) calls?

JackTemaki commented 2 years ago

Where can I see your code? Why is it not pushed?

I did not push the code so that you do not try to comment on unfinished code, but I see that just posting partial things is also not useful.

Where is the code which does the import {module_name}?

This code does not exist, and will not exist in the solution we proposed here. I want to copy the actual nn code into the config. The problem is that importing always means (as I mentioned before), that the job will change when editing code in your recipes. This means if you run job arrays (e.g. search over some hyperparameters) you can easily break stuff that is running, because then a job might be scheduled in the middle of editing a file and thus break. Also, you can not see from the job folder alone what is actually running.

Parameters are usually model parameters

Well, this is about hyperparameters...

In your example, you also did just the same but now it is called source_data and target_data. I don't see how that is different.

No, it is called audio_features and bpe_labels in the extern_data dict. source_data and target_data is just the nn.Data variable name to pass it to the net construction.

I don't really understand. How do you get the nn.Datas and nn.Dims then, and how do you make the nn.get_extern_data(...) calls?

They are created as string code and actually pasted into the config, so that this is verbose and simple to alter for debugging purposes. The nn.get_extern_data calls are the responsibility of the model construction code (I now renamed source_data and target_data):

def construct_network(epoch: int, audio_data: nn.Data, label_data: nn.Data, **kwargs):
    [...]
    net = BLSTMDownsamplingTransformerASR(
        audio_feature_dim=feature_dim, target_vocab=label_dim
    )
    out = net(
        audio_features=nn.get_extern_data(audio_data),
        labels=nn.get_extern_data(label_data),
        audio_time_dim=time_dim,
        label_time_dim=label_time_dim,
        label_dim=label_dim,
    )
    [...]

Ah, I forgot:

Where and how do you define the loss

Right now also in the construct_network.

JackTemaki commented 2 years ago

Independent of how this is created, what I imagine in in the returnn.config is something like this (I changed network_parameters to network_kwargs to make it better understandable):

audio_features_time = nn.Dim(
    kind=nn.Dim.Types.Spatial, description="audio_features_time", dimension=None
)
audio_features_feature = nn.Dim(
    kind=nn.Dim.Types.Feature, description="audio_features_feature", dimension=40
)
audio_features = nn.Data(
    name="audio_features",
    available_for_inference=True,
    dim_tags=[nn.batch_dim, audio_features_time, audio_features_feature],
    sparse_dim=None,
    sparse=False,
)
bpe_labels_time = nn.Dim(
    kind=nn.Dim.Types.Spatial, description="bpe_labels_time", dimension=None
)   
bpe_labels_indices = nn.Dim(
    kind=nn.Dim.Types.Feature, description="bpe_labels_indices", dimension=2051
)
bpe_labels = nn.Data(
    name="bpe_labels",
    available_for_inference=False,
    dim_tags=[nn.batch_dim, bpe_labels_time],
    sparse_dim=bpe_labels_indices,
    sparse=True,
)

extern_data = {
    "audio_features": audio_features.get_kwargs(),
    "bpe_labels": bpe_labels.get_kwargs(),
}

network_kwargs = {"audio_data": audio_features, "label_data": bpe_labels, [...]}

Passing audio_features.get_kwargs() to extern_data does not work right now because name is not allowed to be part of it, so it has to be removed before passing it, but this is just a small detail right now.

I hope this also clarifies what data_map is supposed to do, which is mapping the 'nn.Dataname to thenetwork_kwargs ` variable.

albertz commented 2 years ago

Where can I see your code? Why is it not pushed?

I did not push the code so that you do not try to comment on unfinished code

I don't understand. We are now commenting and discussing about the unfinished code. Why does it make sense to only have it partial? Well, except if it is incomplete obviously, but then we should clarify the missing pieces because many things were not clear to me (see my questions above).

Where is the code which does the import {module_name}?

This code does not exist, and will not exist in the solution we proposed here. I want to copy the actual nn code into the config.

But I don't really know how you would even achieve that. I explained that before. The net construction code would in practice not just depend on returnn_common (RC) and nothing else. Users likely would have their own experimental building blocks and want to reuse them. Reusing building blocks is the main point of the framework of RC, and that means reusing also your own building blocks. And we want to encourage that. It should be simple to put things together based on existing (stable or experimental) building blocks, and also in a hierarchical way, e.g. a building block itself would consist of other building blocks.

source_data and target_data is just the nn.Data variable name to pass it to the net construction.

Yes exactly, and that is what I proposed to call data and classes, nothing else. You just proposed different names now. But you still have done exactly the same as what I suggested.

I don't really understand. How do you get the nn.Datas and nn.Dims then, and how do you make the nn.get_extern_data(...) calls?

They are created as string code and actually pasted into the config, so that this is verbose and simple to alter for debugging purposes.

Where and how? I don't see that in your example. Where do you create it? How does it get into the config?

Btw, in your config code, the code is very non-standard, against all the conventions in RC and also RETURNN. I think it make sense to have the code more consistent. This makes it much shorter, easier to write and easier to read:

audio_features_time = nn.SpatialDim("audio_features_time")
audio_features_feature = nn.FeatureDim("audio_features_feature", 40)
audio_features = nn.Data(
    name="audio_features",
    available_for_inference=True,
    dim_tags=[nn.batch_dim, audio_features_time, audio_features_feature]
)
bpe_labels_time = nn.SpatialDim("bpe_labels_time")
bpe_labels_indices = nn.FeatureDim("bpe_labels_indices", 2051)
bpe_labels = nn.Data(
    name="bpe_labels",
    available_for_inference=False,
    dim_tags=[nn.batch_dim, bpe_labels_time],
    sparse_dim=bpe_labels_indices
)

I hope this also clarifies what data_map is supposed to do, which is mapping the nn.Data name to the network_kwargs variable.

It's still not clear to me where the nn.Data and nn.Dim come from.

Also, how do you pass the nn.Dims? You only explained data_map for the nn.Datas.

Also, I still don't exactly understand your overall suggestion. So, you propose that the user defines such construct_network function in some recipe (get_transformer_network), and ReturnnCommonDynamicNetwork basically wraps that?

But in your example, you rarely if at all would change construct_network. You would change the network_kwargs entries. Or you would change the code of BLSTMDownsamplingTransformerASR.

How does this example would actually look like when you want to experiment with model variants (when just network_kwargs would not be enough)? You would have a separate network_file for each model? So you have network_file + model_file for each model, assuming that this is a custom model defined by yourself in model_file?

How does this example look like when you experiment with loss variants? Here you probably would copy the network_file and modify the loss code, because the loss code is also in that file.

For playing around with obvious hyper params, you probably would have them all as part of network_kwargs. But maybe not all. Then you could just add whatever else is missing to network_kwargs. But doing this would blow up over time your network_kwargs to a mess. I probably would prefer to copy the model itself in such case and modify the model code directly. But then I also need to have a copy of the network_file.

My suggestion before basically would not use such network_file at all but only the model_file, and maybe also a separate loss_file or so, and then somehow (unclear how exactly) connect the extern data with the model and the loss.

Maybe if you make BLSTMDownsamplingTransformerASR (the model) itself a parameter of your network_file, it's close to what I suggested. This mapping between extern data, the model and the loss is then all in the network_file. The model is in another separate model_file and just the class is passed to the network_file.

JackTemaki commented 2 years ago

But I don't really know how you would even achieve that. I explained that before. The net construction code would in practice not just depend on returnn_common (RC) and nothing else. Users likely would have their own experimental building blocks and want to reuse them. Reusing building blocks is the main point of the framework of RC, and that means reusing also your own building blocks. And we want to encourage that. It should be simple to put things together based on existing (stable or experimental) building blocks, and also in a hierarchical way, e.g. a building block itself would consist of other building blocks.

Yes sure, how am I restricting this? The current limitation is only that your custom modules (which are not part of returnn_common itself), need to be in a single file, thats it (and this is easy to change I guess). The only thing that I wanted to say here is:

from i6_experiments import SomeModel

is not allowed to be called from within the returnn process.

I probably would prefer to copy the model itself in such case and modify the model code directly. But then I also need to have a copy of the network_file.

There is no separate network model or loss file, right now I have all in exactly one, which is the get_transformer_network.py. So if the kwargs reach some limitations, I would create a new file. Also if I do mayor edits on any of the building blocks, I would do a copy beforehand.

Also, how do you pass the nn.Dims?

They could either be passed in the same way (calling it data_and_dim_map or whatever), or just extract it from the nn.Data again which I do now.

and then somehow (unclear how exactly) connect the extern data with the model and the loss.

Yes, but this is difficult to do in an automatic way. So I think the user will always want to define that code by hand as well.

Maybe if you make BLSTMDownsamplingTransformerASR (the model) itself a parameter of your network_file, it's close to what I suggested. This mapping between extern data, the model and the loss is then all in the network_file. The model is in another separate model_file and just the class is passed to the network_file.

The current approach could easily be extended to have separate files for network, model and loss. I think we should go with a solution where the user can do the splitting arbitrarily, in the end you just pass (as you correctly suggest) just the name of the function which should be called from within the get_network in the config to get the final network for a specific epoch.

albertz commented 2 years ago

Where is the code which does the import {module_name}?

This code does not exist, and will not exist in the solution we proposed here. I want to copy the actual nn code into the config.

But I don't really know how you would even achieve that. I explained that before. The net construction code would in practice not just depend on returnn_common (RC) and nothing else. Users likely would have their own experimental building blocks and want to reuse them. Reusing building blocks is the main point of the framework of RC, and that means reusing also your own building blocks. And we want to encourage that. It should be simple to put things together based on existing (stable or experimental) building blocks, and also in a hierarchical way, e.g. a building block itself would consist of other building blocks.

Yes sure, how am I restricting this? The current limitation is only that your custom modules (which are not part of returnn_common itself), need to be in a single file, thats it (and this is easy to change I guess). The only thing that I wanted to say here is:

from i6_experiments import SomeModel

is not allowed to be called from within the returnn process.

I don't understand. This is what I explained before. The net construction code would in practice not just depend on returnn_common (RC) and nothing else. Users likely would have their own experimental building blocks and want to reuse them.

So, what I asked a couple of times before: where do you actually have custom user model code? I thought we already clarified that such code is somewhere in i6_experiments.users...models....

Where does your BLSTMDownsamplingTransformerASR come from?

Also, how do you pass the nn.Dims?

data_and_dim_map ... or just extract it from the nn.Data again

How do you extract it from nn.Data? So you hardcode in the assumption that there are a specific number of dim tags, in a specific order?

and then somehow (unclear how exactly) connect the extern data with the model and the loss.

Yes, but this is difficult to do in an automatic way. So I think the user will always want to define that code by hand as well.

Yes but if you have this separate (e.g. in a construct_file), i.e. excluding the model and the loss and extern data but just putting it all together, then this is sth the user defines once and reuses probably for many experiments (unless there are maybe extra targets, more custom losses, or whatever).

JackTemaki commented 2 years ago

Independent of how the actual helpers are implemented, I want to talk only about the user code and resulting config. I now followed the case where you directly import from i6_experiments, but this does not look really different if you extract the package paths and do a local copy:

The user code:

    from i6_experiments.users.rossenbach.returnn.nnet_constructor import ReturnnCommonSerializer,\
        ReturnnCommonExternData, ReturnnCommonDynamicNetwork, NonhashedCode, ReturnnCommonImport

    extern_data = [
        datastream.as_nnet_constructor_data(key) for key, datastream in training_datasets.datastreams.items()]

    config["train"] = training_datasets.train.as_returnn_opts()
    config["dev"] = training_datasets.cv.as_returnn_opts()
    #config["eval_datasets"] =  {'devtrain': training_datasets.devtrain.as_returnn_opts()}

    rc_recursionlimit = NonhashedCode(code=RECURSION_LIMIT_CODE)
    rc_extern_data = ReturnnCommonExternData(extern_data=extern_data)
    rc_model = ReturnnCommonImport(
        "i6_experiments.users.rossenbach.returnn.common_modules.asr_transformer.BLSTMDownsamplingTransformerASR")
    rc_construction_code = ReturnnCommonImport(
        "i6_experiments.users.rossenbach.returnn.common_modules.simple_asr_constructor.construct_network")

    rc_network = ReturnnCommonDynamicNetwork(
        net_func_name="construct_network",
        net_func_map={"net_module": "BLSTMDownsamplingTransformerASR",
                      "audio_data": "audio_features",
                      "label_data": "bpe_labels",
                      "audio_feature_dim": "audio_features_feature",
                      "audio_time_dim": "audio_features_time",
                      "label_time_dim": "bpe_labels_time",
                      "label_dim": "bpe_labels_indices"
                     },
        net_kwargs={'weight_decay': 0.1}
    )

    serializer = ReturnnCommonSerializer(
        delayed_objects=[rc_recursionlimit,
                         rc_extern_data,
                         rc_model,
                         rc_construction_code,
                         rc_network],
        returnn_common_root=returnn_common_root,
    )
    returnn_config = ReturnnConfig(
        config=config,
        post_config=post_config,
        python_epilog=[serializer],
    )
    return returnn_config

And the relevant part of the config:

sys.path.insert(0, "/u/rossenbach/experiments/tts_asr_2021/recipe")
from returnn_common import nn

import resource
import sys

try:
    resource.setrlimit(resource.RLIMIT_STACK, (2**29, -1))
except Exception as exc:
    print(f"resource.setrlimit {type(exc).__name__}: {exc}")
sys.setrecursionlimit(10**6)

bpe_labels_indices = nn.FeatureDim("bpe_labels_indices", 2051)
bpe_labels_time = nn.SpatialDim("bpe_labels_time", None)
audio_features_time = nn.SpatialDim("audio_features_time", None)
audio_features_feature = nn.FeatureDim("audio_features_feature", 40)
audio_features = nn.Data(
    name="audio_features",
    available_for_inference=True,
    dim_tags=[nn.batch_dim, audio_features_time, audio_features_feature],
    sparse_dim=None,
)
bpe_labels = nn.Data(
    name="bpe_labels",
    available_for_inference=False,
    dim_tags=[nn.batch_dim, bpe_labels_time],
    sparse_dim=bpe_labels_indices,
)
audio_features_args = audio_features.get_kwargs()
audio_features_args.pop("name")
bpe_labels_args = bpe_labels.get_kwargs()
bpe_labels_args.pop("name")

extern_data = {
    "audio_features": audio_features_args,
    "bpe_labels": bpe_labels_args,
}

from i6_experiments.users.rossenbach.returnn.common_modules.asr_transformer import (
    BLSTMDownsamplingTransformerASR,
)

from i6_experiments.users.rossenbach.returnn.common_modules.simple_asr_constructor import (
    construct_network,
)

network_kwargs = {
    "weight_decay": 0.1,
    "net_module": BLSTMDownsamplingTransformerASR,
    "audio_data": audio_features,
    "label_data": bpe_labels,
    "audio_feature_dim": audio_features_feature,
    "audio_time_dim": audio_features_time,
    "label_time_dim": bpe_labels_time,
    "label_dim": bpe_labels_indices,
}

def get_network(epoch, **kwargs):
    nn.reset_default_root_name_ctx()
    net = construct_network(epoch, **network_kwargs)
    return nn.get_returnn_config().get_net_dict_raw_dict(net)

I personally do not like to not directly see the construction code, but there is no problem in "pasting" it in, instead of importing it. The ReturnnCommonSerializer is flexible enough to do this in any way you want. You could also skip on the automatic extern_data creation and import your own custom code.

The only thing I do not like is the net_func_map parameter, but I do not see yet how this could be solved differently.

Also, I still think sys.path.insert(0, "/u/rossenbach/experiments/tts_asr_2021/recipe"), so linking directly into the recipes is something that should not be done. But we can leave this to user preference.

(@albertz please do not comment on formatting or simplification, only on if this follows the general principles you imagined, we should leave this for the PR)

albertz commented 2 years ago

I now followed the case where you directly import from i6_experiments, but this does not look really different if you extract the package paths and do a local copy

There should not be any difference at all, right? Except maybe of some parts of the boilerplate code like sys.path.insert.

        net_func_map={"net_module": "BLSTMDownsamplingTransformerASR",
                      "audio_data": "audio_features",
                      "label_data": "bpe_labels",
                      "audio_feature_dim": "audio_features_feature",
                      "audio_time_dim": "audio_features_time",
                      "label_time_dim": "bpe_labels_time",
                      "label_dim": "bpe_labels_indices"
                     },

I still see a couple of problems in this net_func_map:

sys.path.insert(0, "/u/rossenbach/experiments/tts_asr_2021/recipe")

I think it would be better if the path is constructed relative based on __file__, or not?

JackTemaki commented 2 years ago

I now followed the case where you directly import from i6_experiments, but this does not look really different if you extract the package paths and do a local copy

There should not be any difference at all, right? Except maybe of some parts of the boilerplate code like sys.path.insert.

Yes, this is exactly what "does not look really different" means.

net_module should be more directly set like rc_model.name or so. I don't like that you need to repeat that name BLSTMDownsamplingTransformerASR.

Yes, this one is easy.

The dataset key names audio_features and co: You must somehow know what the dataset gives you. I don't like this too much. This anyway looks custom here, I assume because there is some MetaDataset in there? Again I would prefer if you somehow could share the names and have it only once (now you have it a second time somewhere in your dataset definition), maybe as a global constant or so.

You could do it globally, the only difference is then that you loose some detail. Because in my dataset code I specifically want to say bpe_labels while the construction code and net_definition do not need to know if this is bpe or spm or phonemes or whatever. So while you certainly could do a global default, you will always need some mapping. But there is no need to write it by hand. The build_training_dataset function could also return the names, because this is where they are defined.

See e.g. https://github.com/rwth-i6/i6_experiments/blob/main/users/rossenbach/experiments/librispeech/librispeech_100_attention/conformer_2022/pipeline.py#L62

This also mixes up the args for the model definition (only the feature dims needed) and the args for the model call (the data and spatial dims). Although this would be up to the construct_network function. But then basically you again have a separate mapping inside your construct_network. I don't like that we now need this mapping twice, once here for net_func_map, and a second mapping in construct_network.

Yes, this is true, but as we have different levels of abstraction you end up with different names. But if you take care of keeping the names equivalent you can just do (this can be a separate helper somewhere):

net_func_map = {}
for data in extern_data:
    net_funct_map[data.name] = data.name
    net_func_map.update({dim.name: dim.name for dim in data.dim_tags})

(data and dim are here DimInitArgs and DataInitArgs objects which are currently part of my helpers)

Or something somehow similar. I mean in the end this is a user decision if everything is named the same or you want to do some individual mapping.

sys.path.insert(0, "/u/rossenbach/experiments/tts_asr_2021/recipe")

I think it would be better if the path is constructed relative based on __file__, or not?

But how can we determine where the recipes are located based on the location of the returnn config? there is no guarantee that the relative location is always the same. A relative path only works if we copy the respective code to the job folder.

albertz commented 2 years ago

The dataset key names audio_features and co: You must somehow know what the dataset gives you. I don't like this too much. This anyway looks custom here, I assume because there is some MetaDataset in there? ...

... But there is no need to write it by hand. The build_training_dataset function could also return the names, because this is where they are defined.

Ok, this is what I want. But then it's still a bit unclear how you would use it.

See e.g. https://github.com/rwth-i6/i6_experiments/blob/main/users/rossenbach/experiments/librispeech/librispeech_100_attention/conformer_2022/pipeline.py#L62

I don't really understand why you now introduce yet another mapping here. So now you have 3 mappings:

Yes, as you say, this is up to the user. You could also add a few more mappings if you like. But I don't understand why this is useful. This only looks like it would cause confusion and don't really give you any benefit.

sys.path.insert(0, "/u/rossenbach/experiments/tts_asr_2021/recipe")

I think it would be better if the path is constructed relative based on __file__, or not?

But how can we determine where the recipes are located based on the location of the returnn config? there is no guarantee that the relative location is always the same.

Via sth like os.path.dirname(os.path.absname(__file__)) + "/../", you get access to the Sisyphus job dir. I assumed that there should be some way to figure out the Sis base dir from there, or not?

JackTemaki commented 2 years ago

The dataset key names audio_features and co: You must somehow know what the dataset gives you. I don't like this too much. This anyway looks custom here, I assume because there is some MetaDataset in there? ...

... But there is no need to write it by hand. The build_training_dataset function could also return the names, because this is where they are defined.

Ok, this is what I want. But then it's still a bit unclear how you would use it.

See e.g. https://github.com/rwth-i6/i6_experiments/blob/main/users/rossenbach/experiments/librispeech/librispeech_100_attention/conformer_2022/pipeline.py#L62

I don't really understand why you now introduce yet another mapping here. So now you have 3 mappings:

* `build_training_datasets` via `MetaDataset`

* `net_func_map`

* `construct_network`

Yes, as you say, this is up to the user. You could also add a few more mappings if you like. But I don't understand why this is useful. This only looks like it would cause confusion and don't really give you any benefit.

The extern_data that is defined there is not used, this code belongs to a "normal" pipeline. And yes, of course you need a mapping in the meta-dataset.

So in the end there is:

sys.path.insert(0, "/u/rossenbach/experiments/tts_asr_2021/recipe")

I think it would be better if the path is constructed relative based on __file__, or not?

But how can we determine where the recipes are located based on the location of the returnn config? there is no guarantee that the relative location is always the same.

Via sth like os.path.dirname(os.path.absname(__file__)) + "/../", you get access to the Sisyphus job dir. I assumed that there should be some way to figure out the Sis base dir from there, or not?

No, os.path.dirname(os.path.absname(__file__)) gives you /work/asr4/rossenbach/sisyphus_work_folders/tts_asr_2021_work/i6_core/returnn/training/ReturnnTrainingJob.0JFYICnhz8aM/output, you can not know where you recipes are from that, as they are even on a different filesystem (here /u/...)

albertz commented 2 years ago

And yes, of course you need a mapping in the meta-dataset.

So in the end there is:

  • name of the actual "datastream" as it is displayed in RETURNN (extern_data key name)
  • name of the variable for the data object in the constructor function
  • name of the variable for the data object (or dim prefix) in the Module

I would also count the MetaDataset here. Esp the way you use it: Just containing a single dataset, and mapping the data key names to some other names.

They CAN all be the same, so in some way be automatically filled from the first one. ...

I see that there has to be at least one mapping somewhere from the dataset to the model and loss.

But I don't really see the benefit of having more than one mapping (4 mappings in your case). It just looks more complicated without any benefit.

sys.path.insert(0, "/u/rossenbach/experiments/tts_asr_2021/recipe")

I think it would be better if the path is constructed relative based on __file__, or not?

But how can we determine where the recipes are located based on the location of the returnn config? there is no guarantee that the relative location is always the same.

Via sth like os.path.dirname(os.path.absname(__file__)) + "/../", you get access to the Sisyphus job dir. I assumed that there should be some way to figure out the Sis base dir from there, or not?

No, os.path.dirname(os.path.absname(__file__)) gives you /work/asr4/rossenbach/sisyphus_work_folders/tts_asr_2021_work/i6_core/returnn/training/ReturnnTrainingJob.0JFYICnhz8aM/output, you can not know where you recipes are from that, as they are even on a different filesystem (here /u/...)

For example, you access ../info, and see the other input paths, like /u/rossenbach/experiments/tts_asr_2021/work/i6_core/tools/git/CloneGitRepositoryJob.An0M85CZcEIW/output/repository. Now you find the common dir, from the right, yielding /u/rossenbach/experiments/tts_asr_2021/work. Now you go up until you find a dir with the Sisyphus files (e.g. check for config/__init__.py or config.py, settings.py, work, recipe). So you end up with /u/rossenbach/experiments/tts_asr_2021.

But yes, this is unnecessarily complicated and heuristically.

Besides, I don't quite understand why the Sisyphus base dir is so much hidden away. When someone sends me a path to some job dir, I'm often interested in also looking at the related Sis recipe/config code but it's hard to find. But this is unnecessary. There could be some log files with those information. There could be a symlink back to the base dir in the root of the work dir, and in each job there could be a relative symlink to the root of the work dir.

The motivation to have the path relative is to be able to move around both the work dir and the Sisyphus base dir. Although this example here is by far not the only reason why this currently is not possible. I see many other symlinks (e.g. in input) which also all use absolute paths.

Anyway, this is going off-topic here now. Probably your initial suggestion with the absolute path is fine for now then.

albertz commented 2 years ago

So, how do we proceed now? Will you create a PR with some implementations of these ReturnnCommonSerializer, ReturnnCommonExternData, ReturnnCommonDynamicNetwork, NonhashedCode, ReturnnCommonImport? Because I also want to start soon some experiments. Or instead of PR, we could also directly work in the master branch and put it to common, put some README with a warning saying "WIP" or so, and discuss any needed changes here in the issue.

JackTemaki commented 2 years ago

I will make a PR into common, and we can mark this as WIP when merged. But I would not push into master directly.

JackTemaki commented 2 years ago

Besides, I don't quite understand why the Sisyphus base dir is so much hidden away. When someone sends me a path to some job dir, I'm often interested in also looking at the related Sis recipe/config code but it's hard to find. But this is unnecessary. There could be some log files with those information. There could be a symlink back to the base dir in the root of the work dir, and in each job there could be a relative symlink to the root of the work dir.

This sounds a little bit dangerous, as with job-sharing jobs itself might be symlinked from other setup locations. With the symlink back to the root I am not sure if this as any implications as this is a cycle then, but feel free to check if this works.

Also the info file often has some paths in it with the /u/ prefix.

albertz commented 2 years ago

What is the state here now?

JackTemaki commented 2 years ago

We have to wait on the bugfixes from @Atticus1806 until we can fully test the pipeline, so this on hold for now.

albertz commented 2 years ago

But where is the code branch? I also want to work on this. Or should I reimplement all what we discussed here myself?

JackTemaki commented 2 years ago

You can use: https://github.com/rwth-i6/i6_experiments/blob/main/users/rossenbach/experiments/librispeech/librispeech_100_attention/lstm_encdec_2022_extern_build/prototype_transformer.py

As starting point to look at a current setup with this.

albertz commented 2 years ago

But there you have it all under i6_experiments.users.rossenbach..... We want to have it in common. So I move your code there?

JackTemaki commented 2 years ago

I can not just push it in common without review or making an extra branch, so I first put it under my user. Otherwise my Hiwis can not use it easily.

You can of course copy it there, make your own adjustment and then start a PR for discussion. I just did not have any capacity for that yet.

albertz commented 2 years ago

I can not just push it in common without review

Why not? I thought we discussed this already? We would put a readme saying "this is work in progress" or so. I think this would be the easiest workflow.

Sure, we could also have it in a separate extra branch. But not sure if there is really any benefit.

albertz commented 2 years ago

Some initial code was merged as part of PR #66. The code is now in i6_experiments.common.setups.returnn_common, here: https://github.com/rwth-i6/i6_experiments/tree/main/common/setups/returnn_common

I added some summary from the discussion here to the README.

albertz commented 2 years ago

The dataset stuff is still somewhat an open question. We had some initial discussion in the (unmerged, closed) PR #55. My opinion was that such dataset helpers belong to returnn_common as well, at least the definition of the dataset config dicts for RETURNN. Only recipe pipeline related dataset logic should be here in i6_experiments/common.

albertz commented 2 years ago

How should we continue with discussions on the helper code? Here in this issue? In separate issues? Just directly (Slack, in person) (but then others cannot read it afterwards or join the discussion)?