rwth-i6 / i6_experiments

Mozilla Public License 2.0
8 stars 13 forks source link

How to integrate `returnn_common.nn` into Sisyphus pipelines #63

Closed JackTemaki closed 1 year ago

JackTemaki commented 2 years ago

When trying to create Returnn networks with returnn_common.nn, the following issues or questions appeared:

If returnn_common.nn would be used for construction within the manager:

As we will not solve the hashing issue without spending way to much time not doing experiments, I would propose to go with a completely different approach than we did before, which is:

-> Create a job which takes a executable and some pre-defined parameters does the network construction as task instead of in the manager.

This of course would mean that the construction result is not hashed, but only the constructor location and the parameters. To be able to find potentially unwanted changes and to be able to reconstruct the networks exactly, the job should create a local copy of the used construction code at runtime (returnn and returnn_common can of course correctly be identified by their git hash).

While this would require some manual management of the network construction to not make logic changing mistakes, I currently do not see any suitable alternative to this.

albertz commented 2 years ago

DimInitArgs is problematic. It does not exactly match the Dim interface, specifically is_feature. Why not just use a dict for the kwargs?

albertz commented 2 years ago

DataInitArgs is problematic in a similar way. Why does it have has_batch_dim? Why do we need it at all?

albertz commented 2 years ago

All classes with ReturnnCommon... prefix: I think I would rename them and remove this prefix. There is nothing really returnn-common specific about these classes. Also, the package name already indicates that this is to be used for returnn-common. This is redundant and also confusing because there is nothing really returnn-common specific in it.

albertz commented 2 years ago

ReturnnCommonSerializer, I would rename that to SerializerObjectCollection or so because it very much is just like a list of SerializerObjects, with only some optional additional logic.

albertz commented 2 years ago

The logic in ReturnnCommonSerializer (SerializerObjectCollection) to extract packages from the SerializerObjects is very problematic, together with make_local_package_copy. This will miss any other imports inside that files, and there is also no real error checking for such cases. I don't think we should have such broken and dangerous code. So this either should be completely rewritten (which will be quite complex and complicated though) or removed.

albertz commented 2 years ago

ReturnnCommonSerializer (SerializerObjectCollection), the type of packages is inconsistent. In the docstring, it says list, while in the code, it might become a set.

albertz commented 2 years ago

ReturnnCommonImport: _package should not be private. You also access it from outside. This is a Python code violation. Or you don't care about proper Python code?

albertz commented 2 years ago

You should never use os.getcwd().

If the dir is somehow relative to __file__, you can use (and that should be in the global scope of the module, not inside a function):

_module_dir = os.path.dirname(os.path.absname(__file__))

Or if this is the Sisyphus base dir, we can use gs.BASE_DIR.

JackTemaki commented 2 years ago

DimInitArgs is problematic. It does not exactly match the Dim interface, specifically is_feature. Why not just use a dict for the kwargs?

For a dict it is not well defined what parameters should be defined.

JackTemaki commented 2 years ago

You should never use os.getcwd() but __file__ instead, and that should be in the global scope of the module (not inside a function), sth like:

_module_dir = os.path.dirname(os.path.absname(__file__))

But this returns something completely different. I do not see how those two are related.

JackTemaki commented 2 years ago

For the rest of the comments (besides the naming): I do not understand why you are even mentioning it here? You specifically urged me to push code that was not "complete" in a sense that this is just some work in progress code. This is why of course there are typing or private issues.

This is a Python code violation. Or you don't care about proper Python code?

If you are giving comments like this then I will refrain from pushing any work-in-progress code in the future. Also it does not make sense to comment on things like this in the issue here. You specifically told me you do NOT want to a PR style review. So why are you doing this now?

albertz commented 2 years ago

DimInitArgs is problematic. It does not exactly match the Dim interface, specifically is_feature. Why not just use a dict for the kwargs?

For a dict it is not well defined what parameters should be defined.

Why does it need to be? It's whatever nn.Dim accepts. This is defined by RETURNN. I don't see why we should redefine it here. You could also just directly use nn.Dim instead.

albertz commented 2 years ago

You should never use os.getcwd() but __file__ instead, and that should be in the global scope of the module (not inside a function), sth like:

_module_dir = os.path.dirname(os.path.absname(__file__))

But this returns something completely different. I do not see how those two are related.

First of all, again, you should never use os.getcwd(). I assumed that you wanted instead to actually get the module dir. I'm not sure which dir you want otherwise. In any case, do not use os.getcwd(). The current dir can be totally arbitrary and is not well defined. The module dir is, so you can always express it relative to that. Or use some other global config information or so.

albertz commented 2 years ago

For the rest of the comments (besides the naming): I do not understand why you are even mentioning it here?

Where else? I asked you about that, see above. Or you see you don't want to discuss it at all and I should just change it? I don't understand your statement.

This is a Python code violation. Or you don't care about proper Python code?

If you are giving comments like this then I will refrain from pushing any work-in-progress code in the future.

I don't understand?

I just want to know what Python code standards you want to follow here. I'm not used to the code style in i6_core and i6_experiments/common. I see a lot of Python code violations everywhere in there. So I wondered whether you just don't care about that and you want it to be this way or not.

Why do you want that I don't give comments on the code?

Why is that related to pushing work-in-progress code? Why do you want to refrain from pushing WIP code? This does not make sense.

Also it does not make sense to comment on things like this in the issue here.

Again, I asked you about that, where to post these things, and you did not answer, and we already said earlier that we might just discuss it here in the issue, and I did not have a better idea where to post them, so I just posted it here.

Where exactly is the problem now?

Where else should I post them?

Or you just want that I don't give any comments on this and we do not discuss it further?

Also, I just wanted to write them down, to not forget about them. So you don't want to see my comments and I should better keep it private for me?

You specifically told me you do NOT want to a PR style review. So why are you doing this now?

No, I did not say that. I said I want that we work directly in the master branch because I also directly want to work with the code and not just look at it in a PR, or work in another branch.

I did not say that we should stop discussions on the code. I wonder, how do you want to work on the code together if you don't want that we discuss the code?

albertz commented 2 years ago

NonhashedCode, I think it's bad that code as type str refers to the code itself while code as tk.Path refers to the file content. In almost all other code, when we have the type str|tk.Path, it means the filename in both cases. I think this is much more expected this way. I would introduce two separate attribs code_from_file and just code to differentiate it clearly.

(I'm a bit confused about your comments. So instead writing this here, I should just do it? Just as with my other remarks, I just just clean it up as I think it makes sense? In any case, better not discuss with you about it?)

JackTemaki commented 2 years ago

Also, I just wanted to write them down, to not forget about them. So you don't want to see my comments and I should better keep it private for me?

But then please at least refrain from adding unnecessary statements to code comments that could be counted as personal attack. Nobody writes perfect code right away, and of course you can comment on that, but then do this is in a proper professional way without personal accusations, especially when you know that this was just some arbitrary WIP code that was pushed here.

So lets keep this discussion first about relevant high-level things:

Why does it need to be? It's whatever nn.Dim accepts. This is defined by RETURNN. I don't see why we should redefine it here.

But this exactly what I do not want when working with RETURNN, that I have to look up in another code base or documentation what parameters are valid. The code itself should clearly define what is valid. This was the fundamental problem when working with dict-structures for layer definitions, and this is also something that should not happen here.

You could also just directly use nn.Dim instead.

This would bring back the Tensorflow and RETURNN dependency to the Sisyphus manager thread, which I still strongly disagree with.

Then regarding comments I consider rather off-topic at this moment:

W.r.t. the os.getcwd() issue (which I still consider not really relevant at this point), I searched through Sisyphus, and what we could use is gs.BASE_DIR, which is the path that I want.

In almost all other code, when we have the type str|tk.Path.

In most cases this is just because we can not break with old behavior (importing locations from global_settings mostly), otherwise this would have been removed already. Specifying file locations via str in code run by the Sisyphus manager thread should be considered wrong or at least dangerous in any case.

albertz commented 2 years ago

Also, I just wanted to write them down, to not forget about them. So you don't want to see my comments and I should better keep it private for me?

But then please at least refrain from adding unnecessary statements to code comments that could be counted as personal attack.

Where are you reading anything like that? Why do you interpret things like that into what I wrote? I'm totally confused. Are you referring to my question on what code style you want to follow here? So I should not ask about what code style we want to follow?

albertz commented 2 years ago

Why does it need to be? It's whatever nn.Dim accepts. This is defined by RETURNN. I don't see why we should redefine it here.

But this exactly what I do not want when working with RETURNN, that I have to look up in another code base or documentation what parameters are valid. The code itself should clearly define what is valid. This was the fundamental problem when working with dict-structures for layer definitions, and this is also something that should not happen here.

My suggestion was actually to just use the Data and Dim from RETURNN directly. I don't really see the point in replicating another wrapper around it. returnn-common also does the same.

Note that for layers, returnn-common obviously creates a wrapper around them but this is for other technical reasons, and I'm also not super happy with it. However, to really avoid the problem that the wrapper and the layer interface diverge (this is one of the big problems with wrappers, when the wrapper even in the beginning does not fully match the real interface), we create them automatically. Although this might be a bit overkill of course now for Data and Dim.

This would bring back the Tensorflow and RETURNN dependency to the Sisyphus manager thread, which I still strongly disagree with.

I still see this as very arbitrary. We also have e.g. Numpy and other libs. And why not? Your only argument here is that the file system is slow and TF import is slow?

Such technical administrative issues should really not be a reason to make the code more complicated and introduce unnecessary wrappers and layers of abstraction.

In any case, this is why I bring this up here. Such things need to be discussed. Some opinion from others would also be helpful.

JackTemaki commented 2 years ago

Where are you reading anything like that? Why do you interpret things like that into what I wrote? I'm totally confused. Are you referring to my question on what code style you want to follow here? So I should not ask about what code style we want to follow?

Lets talk about this offline, it is definitely not about a question regarding code style (or in case you wanted to ask only that, then you did not formulate it that way).

In any case, this is why I bring this up here. Such things need to be discussed. Some opinion from others would also be helpful.

But about this I do not argue, this is exactly what we can and should discuss.

I still see this as very arbitrary. We also have e.g. Numpy and other libs. And why not? Your only argument here is that the file system is slow and TF import is slow?

And I do not see this as arbitrary. Yes, some modules unfortunately import numpy on the global level, but in most cases it is only used in tasks. But here you could argue that most people kind of consider numpy as elemental part of python.

Again, this is not (only) about the TF import. This is about having a fixed TF and RETURNN dependency in my Sisyphus manager in the first place. The manager should not have that. We can rather argue to remove numpy from the global imports than to add Tensorflow and RETURNN. (E.g. you can run all jobs with a specified singularity/apptainer image, which would specifically define which numpy to use for job tasks, independent of what happens in the manager).

Also, the wrapper has the advantage that you are not showing many parameters that are completely irrelevant. If you expose nn.Dim and nn.Data, you have many options that are not relevant at all and are more confusing than helpful to the user.

The additional code here is really minimal, it is not about re-implementing anything. It is really just a dataclass object holding some parameters, thats it.

With your argument, you could also say we should import modules directly from RASR with C/C++ to Python bindings instead of re-implementing some interface helpers (like bliss format implementations or RASR-flow definitions).

albertz commented 1 year ago

Note on the dataset questions, there has been some progress:

JackTemaki commented 1 year ago

Regarding the main points this Issue became obsolete now, as we are no longer passing constructed networks dicts for training. For both RETURNN-frontend and RETURNN-pure-PyTorch setups, we are linking code directly from separate files into the RETURNN configs using e.g. https://github.com/rwth-i6/i6_experiments/blob/main/common/setups/serialization.py

As importing those separate files containing model code can be avoided if necessary, also importing PyTorch/Tensorflow in the manager is no longer an issue.