Suggestions for discussing a more organized code structure

ArneNx commented 5 years ago

I understand that removing/ deactivating the Theano backend is not an option right now. I don't have enough insight into these parts to know whether moving the corresponding files into a separate folder (instead of adding a prefix to every file) would break something.

However, there appears to be some work on support for a pytorch backend (at least there is a new branch with work on this). Would it be possible to move this stuff to a separate folder directly instead of adding more files to the root folder?

In general, I think it might be a good idea to collect some ideas on how we could improve the structure of the code. I would suggest moving the layers into a folder structure similar to the organization in the docu by @JackTemaki. (see: https://returnn.readthedocs.io/en/latest/layer_reference/index.html#layer-types )

albertz commented 5 years ago

Example projects (for examples of structuring):

TensorFlow. src in /tensorflow/, tests alongside with code in *_test.py files
HuggingFace NLP. src in /src/nlp/, tests in /tests/
AllenNLP. src in /allennlp/, tests in /tests/

I was planning this since a while. But this is a bit non-trivial. There are multiple cases you have to think about:

When RASR imports RETURNN, it does so by importing the SprintInterface.py module (or some of the others). I am not sure if it will correctly work if this is now part of a Python package (submodule). Esp think about relative imports or so.
The main entry point (rnn.py) cannot be part of the package itself. So it has to be separate. Not sure if so important.
Python 2 and Python 3 behave slightly different when it comes to packages (as far as I remember).
All scripts/configs which use the old-style imports like TFUtil.safe_log or TFNetworkLayer.DotLayer should still work in any case (otherwise the wrapper would be wrong).
All scripts/configs which gradually updated the code, e.g. to returnn.tf.layers.DotLayer or so, should also still work, even if we move it then to returnn.tf.layers.base.DotLayer or whatever. The base package returnn.tf.layers would import all others.

Draft about the structure:

Use PEP8 snake case for all file package/module names.
Use only clear abbreviations, but which ones? (e.g. recurrent -> rec)
Move all code files to /returnn/ (or further sub directories, see below).
- Exceptions: rnn.py, setup.py, tools, demos, tests, docs (leave in /)
- Include: extern, extern_private, c_support_code, cuda_implementation
- File name conventions: lower case with _ between words
Move most code from rnn.py to /returnn/__main__.py.
Keep /rnn.py, but mostly just call to /returnn/__main__.py.
Move TF*.py to /returnn/tf/*.py with new name conventions . (Example: import TFEngine becomes import returnn.tf.engine)
Move TFNetwork.py to /returnn/tf/network.py.
Move TFNetworkLayer.py to /returnn/tf/layers/basic.py and also create /returnn/tf/layers/__init__.py.
Move TFNetworkRecLayer.py to /returnn/tf/layers/rec.py.
Move TFNetworkSegModLayer.py to /returnn/tf/layers/segmental_model.py.
Move TFNetworkSigProcLayer.py to /returnn/tf/layers/signal_processing.py.
Move TFNetworkNeuralTransducer.py to /returnn/tf/layers/neural_transducer.py.
Move TFUtil.py to /returnn/tf/util/basic.py.
Move Theano files to /returnn/theano/.
Moving Theano layer files to /returnn/theano/layers/.
Move Op*.py, TwoState*, CTC* and BestPathDecoder* to /returnn/theano/ops/.
Move Util.py to /returnn/util/basic.py.
Move TaskSystem*.py to /returnn/util/task_system_*.py.
Move Debug*.py to /returnn/util/debug_*.py.
Move Sprint*.py (except SprintDataset.py) to /returnn/sprint/*.py.
Move Engine(Base|Batch|Util).py to /returnn/engine/
Move *Dataset.py to /returnn/datasets/
Move Fsa.py to /returnn/util/fsa.py

Code structure:

Use mostly absolute imports (from returnn.tf.layers.base import LayerBase) except maybe a few exceptions for relative imports (from .base import LayerBase).

Other tasks: (Just an overview. Let's not go into too much details for now, unless there are real problems.)

Fix setup.py for the new structure.
Fix tests.
Create /__init__.py to handle a direct import of the repo directory itself. This should just wrap all to /returnn/__init__.py. Probably needs some path trickery.
Setup wrappers that old code/configs don't break. TFUtil, TFNetworkLayers, etc... This could be handled via the lazy loader in /__init__.py or /returnn/__init__.py. The lazy loader would otherwise not be needed anymore.
Test it.

Practical organization:

Finish planning.
Notify relevant people about the upcoming changes, and ask for help on testing.
Create new branch and freeze master (via GitHub branch protection rule).
Rename & move files without changing them (just git mv, such that the Git history stays clean, such that Git blame and co can correctly identify the rename/move of files; see here for a related question).
Apply other needed changes (fixes etc). Iterate by testing.
Merge into master. Ideally shortly after creating the branch, so that master is not frozen too long. Then we can unfreeze it.

ArneNx commented 5 years ago

I understand the first and second point. Concerning the third point, is there any plan to end the support for Python 2 in RETURNN? If this problem would be eliminated by relying on Python 3 conventions for handling packages, is there anything that prevents us from moving at least the layers in separate subfolders?

albertz commented 5 years ago

For me personally, I don't rely on Python 2. But there are many people who still use it (for various reasons). Maybe you just have an old setup with Python 2, and you want to test it again with a recent RETURNN version and some new feature X. Maybe other things break the setup when you switch over to Python 3. So, stopping Python 2 support should have a good reason. I would wait a bit with that. But anyway, maybe with respect to packages, this problem is solvable in some way (I don't remember exactly anymore).

Just moving some of it (like layers) to a sub package already introduce all the mentioned problems.

JackTemaki commented 4 years ago

To revive this issue, I would like to add a new draft to tackle this:

Reasons for Restructuring

With the current file structure it is difficult to navigate trough the code without using the help of an IDE as e.g. PyCharm
Very long files cause lags or interruptions in PyCharm during coding
With the current file structure it is difficult to understand the code, especially it is not visible which files are used for both backends, and which are Theano only.
Many files contain arbitrary code, such as GeneratingDataset.py, which contains Datasets made for tests, but also for specific ASR corpora and also vocabulary functions/classes and feature extraction pipelines.
Large files make it difficult to see which parts exactly are changed in a commit, and when working on different problems --patch needs to be used often to select the relevant parts.

Known Issues

Sprint/RASR Search:

There are multiple approaches used to access the Sprint. Common examples are:

pymod-name                = crnn.SprintInterface
pymod-path                = ./path/to/crnn/..

or

pymod-name                = SprintInterface
pymod-path                = ./path/to/crnn

where crnn is the old repository name of RETURNN. Those variants should continue to work. If the SprintInterface would now be part of a subpackage (e.g. returnn.sprint.SprintInterface), absolute imports may break as the python path will not be set to the correct root directory. [1]

Standalone Execution:

When running RETURNN just from the rnn.py, the old imports need to be still valid, common is e.g.: from Pretrain import WrapEpochValue [2]. Also, gobal variables such as config inside the config need to be present. [3]

RETURNN Scritps:

All scipts that import from RETURNN will have broken dependencies when files are moved to subdirectories, so a root-level __init__.py file needs to make sure that all old imports are still valid when the path is set to /path/to/returnn-git [4], but also when calling import returnn when the path is only /path/to/ and the repo name itself is returnn. [5]

Python 2 Compatibility:

All modifications need to be compatible to Python2, but no issues appeared yet. [6]

Proposal: I have a current test environment which was able to solve the Issues [1-3][5] and [6], but fails on [4].

This changes included:

moving all .py code files except rnn.py to a returnn folder.
creating subpackages for theano and tensorflow related files
creating subpackages for all datasets and sprint related files
update all imports as absolute imports from the root level
keeping the __init__.py containg the lazy loader at the root-level, and adjusting the lazy loader to recursively add all modules, but store them under their name only to ensure that [1] and [5] are working
moving all content except the main function to returnn/main.py, and adding from returnn.main import * in the rnn.py to ensure all content is still accessible.

The test cases used were:

RASR search with python2.7 with three different RETURNN import variants.
Training with SprintExternDataset
Standalone-search with Attention-ASR

Todo:

Finding a solution for issue [4]
Further restructuring of the code
Finding more test setups and checking for those
Ensuring all current test-cases run, as this proposal was made with an older RETURNN version
fixing remaining cyclic dependencies
Fixing the setup.py and pip installation

albertz commented 4 years ago

About the motivation:

With the current file structure it is difficult to navigate trough the code without using the help of an IDE as e.g. PyCharm

I don't think there is much difference though, right? Actually, having a directory structure can make it even harder without an IDE to browse the code, and to get a good overview of the code.

But this is not really the point of restructuring anyway, right? No matter if you use an IDE or not, it would be cleaner. I think this is the only reason here. Which by itself is maybe good enough.

Very long files cause lags or interruptions in PyCharm during coding Large files make it difficult to see which parts exactly are changed in a commit

But this restructuring proposal here is about introducing packages and a directory structure. It is not about splitting up files. Splitting up files is a separate orthogonal thing. We can also do it without this restructuring. (Although, maybe it is nicer with the restructuring.) Or do you want to do both together here? Then I misunderstood the proposal. But this becomes a much bigger task then.

With the current file structure it is difficult to understand the code, especially it is not visible which files are used for both backends, and which are Theano only.

I would argue this is a very minor problem, and once you invested only a minimal amount of time, it is already clear (e.g. check the technological overview part in the documentation).

Many files contain arbitrary code, such as GeneratingDataset.py, which contains Datasets made for tests ...

This again seems like an orthogonal issue. Also, I'm not sure I would agree. GeneratingDatasets is for datasets which are generated. These are not just for unit tests. These are valid synthetic datasets which you often find used in research papers.

albertz commented 4 years ago

About the proposal (and also check my own proposal above):

moving all .py code files except rnn.py to a returnn folder.

Not better src/returnn? But maybe returnn is fine.

Let's see what other Python projects do (e.g. here). Seems that returnn would be more standard.

update all imports as absolute imports from the root level

Or relative?

returnn/main.py

Or returnn/__main__.py?

Also, about the further TODOs:

Finding a solution for issue [4]

Not sure what [4] is exactly (there are no numbers; you mean GeneratingDatasets?). Also not sure why this is so relevant for this issue here. Let's not make it too complicated. Let's do it step by step. Focus on one thing here (which is the restructuring).

Further restructuring of the code

This is a bit ambiguous, and also, as I said, not so relevant. Let's do it step by step. Open separate issues for other things.

Ensuring all current test-cases run, as this proposal was made with an older RETURNN version

What do you mean by "this proposal was made with an older RETURNN version"? I thought this is a proposal for a future RETURNN version?

fixing remaining cyclic dependencies

What exactly? What is the problem?

JackTemaki commented 4 years ago

It is not about splitting up files. Splitting up files is a separate orthogonal thing. We can also do it without this restructuring. (Although, maybe it is nicer with the restructuring.)

I thought that both is the aim: first, introducing a structure, and then splitting the files which are definetly too large, or containing very mixed content that should rather be not in a single file.

I would argue this is a very minor problem, and once you invested only a minimal amount of time

even if you can do it quite fast, it would be cleaner to have it separated.

This again seems like an orthogonal issue

Yes, but once there is a structure it is easier to do a clean separation of things that do not belong together.

GeneratingDatasets is for datasets which are generated.

But this is exactly what I mean, more than 50% of the content is not related to that.

about the proposal (and also check my own proposal above):

I am sorry that this lead to a confusion, this is not a theoretical proposal, but an actual proposal. I applied all those changes already in a local repository, and confirmed that they work under the situations mentioned.

Or relative?

yes, I used absolute and would be in favour of that, but maybe relative also works (or a combination)

Or returnn/main.py?

In this case the file should not be the executable, maybe main is just the wrong name. But this is open for discussion. __main__.py is a reserved name for direct execution, so I am not sure if this should be used here.

Not sure what [4] is exactly

In issues I put numbers behind the most dominant problems, and use the numbers to refer to them.

What do you mean by "this proposal was made with an older RETURNN version"? I thought this is a proposal for a future RETURNN version?

I applied the changes to a RETURNN version of April, and just now looked at this again. I don't think that something important changed until now.

JackTemaki commented 4 years ago

Also, the original Issue was about doing both, structuring and splitting files, as this should be discussed together. I am not sure if it makes sense to look at the problems separately. I think what you mean is that enabling a structure without breaking setups is a first step, and then actually designing the structure and refactoring the files is a second step.

albertz commented 4 years ago

I thought that both is the aim: first, introducing a structure, and then splitting the files which are definetly too large, or containing very mixed content that should rather be not in a single file.

But as you say, these are two steps. Let's not mix them up here. And we should definitely do it step by step, otherwise the change will be too big and too annoying to handle and discuss.

albertz commented 4 years ago

__main__.py is a reserved name for direct execution

See the docs. There is not much special to it. It only allows to run the module via python -m returnn. I would argue that the name main.py does not make much sense if this is supposed to be the main entry point. Then we can directly use it as __main__.py. Right?

JackTemaki commented 4 years ago

Right?

I thought that we keep an rnn.py file at the root level as the entry point (which is definitely needed for backwards compatibility), but we could actually support python -m returnn as well if there is an __main__.py file with a "main" function inside, you are right about that.

JackTemaki commented 4 years ago

fixing remaining cyclic dependencies

What exactly? What is the problem?

I forgot to mention this, while moving the code, at some cases cyclic dependencies were suddenly an issue. I fixed some of those, but there were some remaining cases that are not appearing while running the setups, but appear in the RETURNN tests.

This is an example, but I think this is detail work and does not need to be discussed here:

  File "./returnn_test_version/tests/test_SprintInterface.py", line 12, in <module>
    from returnn.theano_backend.Engine import Engine
  File "./returnn/theano_backend/Engine.py", line 12, in <module>
    from returnn.theano_backend.Network import LayerNetwork
  File "./returnn/theano_backend/Network.py", line 10, in <module>
    from returnn.theano_backend.NetworkLayer import get_layer_class
  File "./returnn/theano_backend/NetworkLayer.py", line 26, in <module>
    _initLayerClasses()
  File "./returnn/theano_backend/NetworkLayer.py", line 13, in _initLayerClasses
    from returnn.theano_backend.NetworkOutputLayer import FramewiseOutputLayer
  File "./returnn/theano_backend/NetworkOutputLayer.py", line 17, in <module>
    from returnn.sprint.SprintErrorSignals import sprint_loss_and_error_signal, SprintAlignmentAutomataOp
  File "./returnn/sprint/SprintErrorSignals.py", line 20, in <module>
    from returnn import TaskSystem, Device
  File "./returnn/Device.py", line 8, in <module>
    from returnn.theano_backend.Network import LayerNetwork
ImportError: cannot import name 'LayerNetwork'

albertz commented 4 years ago

Also, the original Issue was about doing both, structuring and splitting files, as this should be discussed together.

No. Why do you think so?

Also, this would make this issue much more complex, and would require much more planning. This is always very annoying and makes it more complicated than needed. Even the discussion here becomes already way too lengthy because of this.

I am not sure if it makes sense to look at the problems separately.

Why not? These are two independent issues.

I would also say that splitting up the code files can be an iterative process. You don't need to do it all at once. You can do it step by step. It does not matter. So that's why you can also not do it at all for now, and keep this as a separate step. Which makes it much easier, i.e. the discussion here, the planning, etc.

I think what you mean is that enabling a structure without breaking setups is a first step, and then actually designing the structure and refactoring the files is a second step.

Without breaking setups is yet another thing. But this is what we anyway always must have, so that's why I did maybe not mention this explicitly. So this is a must. Both for the restructuring and also for splitting up files. In both cases, some extra care need to be taken for this aspect.

I forgot to mention this, while moving the code, at some cases cyclic dependencies were suddenly an issue.

I don't understand. Why can there suddenly be a cyclic dependency?

Or is this again not about the restructuring but about splitting up files? As I said, let's focus here on just the restructuring, nothing else. Let's do it step by step.

JackTemaki commented 4 years ago

No. Why do you think so?

It was explicitly said: "I would suggest moving the layers into a folder structure similar to the organization in the docu"

Also, this would make this issue much more complex, and would require much more planning.

Yes, I thought this is what we want to do.

I would also say that splitting up the code files can be an iterative process.

This can be dangerous, as for each iteration classes will move. If you do not care about scripts that access classes directly breaking in the process again, this is fine. But I thought you want to avoid this, and move all at once, so that after the "restructuring" everything has its fixed place. Of course this makes it complicated. I would be fine doing this step by step, but this causes problems (possible incompatiblity to setups written during the process) we usually want to avoid.

Without breaking setups is yet another thing. [...] So this is a must.

Of course, I never said something contrary

I don't understand. Why can there suddenly be a cyclic dependency?

This appeared when just moving the files into submodules (no splitting), but this is a minor Issue.

albertz commented 4 years ago

I would also say that splitting up the code files can be an iterative process.

This can be dangerous, as for each iteration classes will move. If you do not care about scripts that access classes directly breaking in the process again, this is fine. But I thought you want to avoid this, and move all at once, so that after the "restructuring" everything has its fixed place. Of course this makes it complicated. I would be fine doing this step by step, but this causes problems (possible incompatiblity to setups written during the process) we usually want to avoid.

This should not happen.

All scripts/configs which use the old-style imports like TFUtil.safe_log or TFNetworkLayer.DotLayer should still work in any case (otherwise the wrapper would be wrong).
All scripts/configs which gradually updated the code, e.g. to returnn.tf.layers.DotLayer or so, should also still work, even if we move it then to returnn.tf.layers.base.DotLayer or whatever. The base package returnn.tf.layers would import all others.

Also, this would make this issue much more complex, and would require much more planning.

Yes, I thought this is what we want to do.

But this is basically an infinite big and complex task which is not possible to finish (at least at once). We have not even started yet. The only proposal discussed here is that code files are moved to returnn/.

Where exactly are the TF files then? I guess returnn/tf/.
Theano files to returnn/theano/.
TFNetworkLayer.py to returnn/tf/layers.py.
... (This so far is what I consider anyway part of this discussion here. But these important parts are missing from the proposal.)

But then, what files to split, what package/mod names, etc etc. There are really lots of things. I really don't see any other way except of doing this gradually.

albertz commented 4 years ago

Btw, another thing which we should plan a bit: How to organize this change? I.e. should it be a PR (based on a branch here), or directly be done in master? The problem is also, all other PRs will likely become somewhat broken. As well as all forks with custom changes. As well as all branches. Also, when this is worked on in a PR/branch, we would need to make sure that no changes are pushed in the meanwhile to master, because otherwise this would diverge.

albertz commented 4 years ago

Well, maybe instead of discussing now too much what exactly should all be part of this issue and the initial change, let's just work on the proposal more (i.e. concretize in detail what we want to change), and then maybe it becomes anyway more clear what we can do right now, and what maybe later.

But I'm still sure that we will not manage to make it perfect and complete in a single change/step. So in any case, we should think about allowing further changes/improvements later on. It would be bad if we design/organize it in a way which would not allow for easy further improvements later on (without breaking things again and again). But as I said, I don't really see a big problem there.

albertz commented 4 years ago

Also, for working on the proposal, relevant people should be involved, or at least take a look at this. I will assign them. (Edit: I assigned maybe more people than necessary. But maybe you are all interested in taking a look at this.)

JackTemaki commented 4 years ago

I really don't see any other way except of doing this gradually.

My initial idea was to reserve a week for all people that are involved in this, and as such keeping the time in which might be new PRs / changes to the master as small as possible. But I guess that is not doable.

Btw, another thing which we should plan a bit: How to organize this change?

My Idea was to have a personal meeting (maybe not physical, but via e.g. Zoom) to work on this things, as discussing all details in text form will take a lot of time. Right now, the number of open pull requests is quite low, and #189 and #285 would not be affected by the changes, so only 2 pull requests will be a problem.

I.e. should it be a PR (based on a branch here)

I think that is the better approach as everyone can do testing without directly changing the master

As well as all forks with custom changes. As well as all branches.

People that did not try to merge for many months/year will have to do new changes anyway, and I think some never intended to merge/rebase. I am not sure about the branches in the main repository, like e.g. PyTorch. But I guess this adds mostly new files.

The only proposal discussed here is that code files are moved to returnn/.

The actual position is only a very minor part, as collecting the test cases and designing the wrapper parts is what I considered the main issue. I did not add this to the proposal, because this is not something that I can decide myself. To test the general capabilities of the wrapper I would do the following changes:

move all files with "Dataset" in the name to "returnn/datasets"
move all files with "Sprint" in the name to "returnn/sprint" (or maybe we rename it to rasr now)
move all files with "TF" in the name to "returnn/tf/"
move all files that are used for theano only to "returnn/theano/"

And not moving the rest of the files. If the files itself are not renamed, the lazy loader just needs to scan the files recursively. If we want to change the names, we might need to create something like a "compat" package containing wrappers with the original file names, or introduce a map in the lazy loader that maps the old module names into new module names. But without testing it I can not tell what is a good idea.

albertz commented 4 years ago

The actual position is only a very minor part, as collecting the test cases and designing the wrapper parts is what I considered the main issue. I did not add this to the proposal, because this is not something that I can decide myself.

I don't understand. This issue here is about the restructuring (i.e. moving the code). This is what we should discuss. This is not a minor part. This is the main part here. I.e. how to restructure the code.

If you think this the actual position is not really so important, we could also simply close the whole issue here with this argument.

All other things are minor. How to write the wrapper, how to fix up some scripts or fix other problems. These are small minor technical details, which are all simple (even trivial) to fix. If you think this is difficult, I can do it. But don't think about this now too much. We can always come up with solutions. This is not relevant now. And also these are hard to discuss all in advance anyway, because we will most likely not think about all possible edge cases. So this is sth we will just see during testing.

move all files with "Sprint" in the name to "returnn/sprint" (or maybe we rename it to rasr now)

This is somewhat contradictory to your other points. Most Sprint related things are either a dataset, or related to Theano, or related to TF. Some modules are also supposed to be imported by Sprint itself (e.g. for the dataset). There is I think only a single module which is imported by Sprint which would both be used for TF and Theano.

And not moving the rest of the files.

I don't understand. You still would move all Python files to /returnn/, right?

The lazy loader in the normal case should not need to be used at all. There should be no magic happening for the normal case. Any magic (like a lazy loader) is always bad (to understand easily what is happening). The lazy loader should only help to recover the unusual cases, which would be broken otherwise. We could even maybe add deprecation warnings whenever the lazy loader is used.

or introduce a map in the lazy loader that maps the old module names into new module names.

Yes, I was thinking about sth like that. It also would become more complex once code files are splitted. A simple mapping (on file level) is not enough then. But the lazy loader can be extended in whatever way (and this can also be done gradually).

What about other files? I guess tools and demos would stay there. tests also moved to /returnn/tests. setup.py and rnn.py would stay in /, but all other code files moved to /returnn/. extern and extern_private also moved to /returnn/. As well as c_support_code and cuda_implementation. docs would stay in /.

JackTemaki commented 4 years ago

I don't understand. This issue here is about the restructuring (i.e. moving the code). This is what we should discuss.

Ok it seems there was a misunderstanding of what is part of this issue and what should be excluded.

The lazy loader in the normal case should not need to be used at all.

Yes, this is only for the Sprint related functions. Maybe we find a solution to remove it completely.

But then lets not discuss any of the implementation points any further here, and stick to the actual structuring:

Updated Proposal

Datasets:

All *Dataset*.py -> returnn/datasets/
BundleFile.py and NormalizationData.py are used only by the StereoDataset, and should be there as well

Engine:

The following files moved to returnn/engine/:

EngineBase.py
EngineBatch.py, although this is also very related to the datasets
EngineUtil.py

Tensorflow:

The following files are moved to returnn/tf/layers:

TFNetworkLayer.py
TFNeuralTransducer.py
TFNetworkRecLayer.py
TFNetworkSegModLayer.py
TFNetworkSigProcLayer.py
All otherTF*.py -> returnn/tf/

Theano:

The following files moved to returnn/theano/:

ActivationFunctions.py
BestPathDecoder.*
CTC.*
Device.py
Engine.py
EngineTask.py
Inv.*
TheanoUtil.py
Updater.py

The following files moved to returnn/theano/ops/:

All Op*.py
FunctionLoader.py
CustomLstmFunctions.py
TwoStateBestPathDecoder.*
TwoStateHMMOP.py

The following files moved to returnn/theano/network/:

All Network*.py

Sprint:

The following files moved to returnn/sprint/:

SprintControl.py, which is used by tensorflow only
SprintErrorSignals.py, which is used by both backends
SprintExternInterface.py
SprintInterface.py

Remaining

Config.py
Debug.py
DebugHelpers.py
HyperParamTuning.py
Log.py
Pretrain.py
TaskSystem.py
Util.py

Also there are some files which look more like tools than core components, such as:

SprintCache.py

The file Server.py looks like it works with Theano only.

The rnn.py functionality can be moved to returnn\__main__.py, but the file itself needs to remain at the root level.

I think it is more common for python projects to have the tests folder at the root level, but I am not sure.

JackTemaki commented 4 years ago

It would be really nice if we could get access to at least one Theano setup. @doetsch @doetsch-apptek maybe?

albertz commented 4 years ago

What naming convention for module names do you want to follow? All lower case and with _ between words? Or CamelCase like we have?

Tensorflow:

The following files are moved to returnn/tf/layers:

TFNetworkLayer.py

TFNeuralTransducer.py

TFNetworkRecLayer.py

TFNetworkSegModLayer.py

TFNetworkSigProcLayer.py

How would you call the files then? I would definitely remove the prefix TF (redundant) as well as the postfix Layer (redundant). Not sure about Network. Maybe the files should anyway be better moved to /returnn/tf/network/? E.g. rec.py? Or RecLayer.py (when inside .../network/)? Or rec_layer.py? And base.py?

The following files moved to returnn/theano/network/:

All Network*.py

But this is inconsistent to the TF part. I think it should be consistent.

Sprint:

The following files moved to returnn/sprint/:

SprintControl.py, which is used by tensorflow only

No, wrong. This is used by the dataset as well.

SprintErrorSignals.py, which is used by both backends

SprintExternInterface.py

SprintInterface.py

How to call these files? I assume we should remove the Sprint prefix. And follow the common naming conventions, i.e. e.g. control.py, error_signals.py, etc.

Remaining

Config.py

Debug.py

DebugHelpers.py

HyperParamTuning.py

Log.py

Pretrain.py

TaskSystem.py

Util.py

Also LearningRateControl.py, and maybe others you forgot.

And where would you put them? Just directly to /returnn/?

Also there are some files which look more like tools than core components, such as:

SprintCache.py

This was copied from some other code or repo (don't remember exactly anymore). I think it is used for SprintCacheDataset.

The file Server.py looks like it works with Theano only.

Yes, possible. But what's the conclusion? Or what is specific here? Should be moved to /returnn/theano/, right? Or what is unclear here?

I think it is more common for python projects to have the tests folder at the root level, but I am not sure.

Yes, ok.

patrick-wilken commented 4 years ago

I would go for "snake_case" for all file names as it is recommended in PEP8. And, if we don't have "TF" and "layer" in the file names anymore, I would vote for avoiding abbreviations as long as the length of the filename stays reasonable (e.g. TFNetworkRecLayer.py -> /returnn/tf/layers/recurrent.py, TFNetworkSigProcLayer.py -> /returnn/tf/layers/signal_processing.py).

Looks like we could need /returnn/util containing what is now Util.py, TaskSystem*.py, and maybe Debug*.py to keep the /returnn directory clean. But things like Config.py and Log.py can simply be put there, I would say.

JackTemaki commented 4 years ago

But this is inconsistent to the TF part. I think it should be consistent.

Good point, then I suggest using also returnn/theano/layers for "Network*Layer.py" files.

But then we would have those files remaining:

Network.py
NetworkCopyUtils.py

No, wrong. This is used by the dataset as well.

I trusted the "find usage" of PyCharm again, yes this was completely wrong

Should be moved to /returnn/theano/, right? Or what is unclear here?

Yes, I just wanted to raise the question

I would vote for avoiding abbreviations as long as the length of the filename stays reasonable

That sounds good.

albertz commented 4 years ago

I would go for "snake_case" for all file names as it is recommended in PEP8. And, if we don't have "TF" and "layer" in the file names anymore, I would vote for avoiding abbreviations as long as the length of the filename stays reasonable (e.g. TFNetworkRecLayer.py -> /returnn/tf/layers/recurrent.py, TFNetworkSigProcLayer.py -> /returnn/tf/layers/signal_processing.py).

Are you sure? Sometimes we still have multiple words in there, and then this could become very long. Or we would sometimes use abbreviations, and sometimes the full names, but this is inconsistent, which is then also somewhat ugly. It also becomes somewhat Java-like then, which is very uncommon in the Python world.

Looks like we could need /returnn/util containing what is now Util.py, TaskSystem*.py, and maybe Debug*.py to keep the /returnn directory clean.

You mean /returnn/util/util.py etc?

But things like Config.py and Log.py can simply be put there, I would say.

You mean to /returnn/config.py and /returnn/log.py?

patrick-wilken commented 4 years ago

Well, the objective should be how easy it is to read. Things like "Op" and "CTC" are clear of course (or at least writing those out wouldn't help), but without having looked into them I honestly don't know what "Inv.py" and "TFNetworkSegModLayer.py" are about. Currently there are not many of those cases anyway. But seg_mod.py? Nah ;P

You mean /returnn/util/util.py etc? You mean to /returnn/config.py and /returnn/log.py?

Yes and yes.

JackTemaki commented 4 years ago

/returnn/util/util.py

That does not look too nice.... would we generally accept to have modules in an __init__.py file, or do we want to avoid that? Otherwise Util.py -> /returnn/util/__init__.py would be ok

Sometimes we still have multiple words in there, and then this could become very long.

We should then think of a different name for those extreme cases (like two_state_best_path_decoder.py, but for everything that can be composed of two or maybe 3 words I would rather use the full words instead of unclear abbreviations.

albertz commented 4 years ago

Well, the objective should be how easy it is to read. Things like "Op" and "CTC" are clear of course (or at least writing those out wouldn't help), but without having looked into them I honestly don't know what "Inv.py" and "TFNetworkSegModLayer.py" are about. Currently there are not many of those cases anyway. But seg_mod.py? Nah ;P

But the problem here is not so much the abbreviation. Having "inverted" instead of "inv", or "segmental" instead of "seg", or "module" instead of "mod", that doesn't really change much, or does it? I think the names itself might be better chosen, but I still would keep using very standard abbreviations ("inv", "seg", "mod", etc). Actually your examples are I think mostly for outdated/unused code anyway.

albertz commented 4 years ago

/returnn/util/util.py

That does not look too nice.... would we generally accept to have modules in an __init__.py file, or do we want to avoid that? Otherwise Util.py -> /returnn/util/__init__.py would be ok

No, I think we should not put too much code into __init__.py. Maybe /returnn/util/basic.py instead?

... I would rather use the full words instead of unclear abbreviations.

Yea, sure. I was anyway speaking mostly about clear and very common abbreviations.

JackTemaki commented 4 years ago

or "module" instead of "mod"

This is exactly why we should avoid it. This abbreviation was intended to be "segment model" not "segment module".

And even inverse.py would be a bad name... inverse of what? Inverse model? But then inverse model with respect to? To be honest, even when looking shortly over the code (e.g. class names) I have no idea what it does, as there is also AlignOp and InvAlignOp. Of course this is because of bad naming and non-existing docstrings, but the file name can be better as well.

albertz commented 4 years ago

This is exactly why we should avoid it. This abbreviation was intended to be "segment model" not "segment module".

Well, "mod" is definitely not common for "model", but it is very common for "module", esp in the context of Python. I would say, for Python code, having "mod" for "module" is always totally ok. I.e., it is not ok for "model", so "model" should be written out. I think this is a pretty straight-forward case.

inverse.py would be a bad name... inverse of what? Inverse model?

You can extend it, but this is not about the abbreviation here. Call it inv_model.py or so. (I don't really know what this is about. It's anyway too specific, and does not belong to RETURNN at all, just as well as the segmental model, so these are anyway very bad examples.)

I would still say, for common abbreviations, we should definitely keep using the abbreviations. util is also such an example. Even returnn itself. Or tf. Or config. It does not make sense to write out these things. In the same manner, I would say that rec is just as clear. Or sig. Or proc. Or param.

This is really pretty standard in Python code, and I think we should stick to Python conventions. E.g. Python uses def, init, str, int, repr, etc.

So, this effectively becomes more a discussion now about which are the cases which are ok. Let's also not put too much priority on these outdated code files which are anyway very bad examples.

Also, this is a question about consistency. Should we use both "rec" and "recurrent". Or only one of these? RecurrentLayer but _SubnetworkRecWrappedLoss? (Or maybe to a certain degree, it's not too much a problem... I guess you will find both "param" and "parameter" somewhere in the code...)

doetsch-apptek commented 4 years ago

It would be really nice if we could get access to at least one Theano setup. @doetsch @doetsch-apptek maybe?

What do you mean with theano setup? A training setup? I can provide you with that

there appears to be some work on support for a pytorch backend

Yes, but does it bother you in this branch? If it does then we can delete it, I still have a local copy of this backend and nobody is using it anyway (I just wouldn't want to delete it completely).

In general I think is a good thing to do this code restructuring. We could also make the theano RETURNN a standalone version which people have to checkout explicitly, while the actual RETURNN version only contains the TF backend.

pavelgolik commented 4 years ago

This is also a good opportunity to unify the use of underscores and dashes under tools/.

albertz commented 4 years ago

This is also a good opportunity to unify the use of underscores and dashes under tools/.

Not sure about that. We are very careful here to not break old configs and scripts. This would likely break some scripts. We could add symlinks for old name -> new name for the tools. But not sure if this is really nicer than just leaving them as they are.

JackTemaki commented 4 years ago

I just saw that these two points are still open in the draft:

Move TFNetworkLayer.py to /returnn/tf/layers.py? Or /returnn/tf/layers/base.py and also create /returnn/tf/layers/__init__.py?. Move TFNetworkRecLayer.py to /returnn/tf/rec_layers.py? Or /returnn/tf/layers/rec.py.

I would definitely favour to use /returnn/tf/layers/base.py and /returnn/tf/layers/rec.py, as we want to support future splitting of the layer files.

JackTemaki commented 4 years ago

There are still some open questions:

Move TFNetworkRecLayer.py to /returnn/tf/rec_layers.py? Or /returnn/tf/layers/rec.py?
Moving Theano layer files to /returnn/theano/layers?
Unify tool names, and keep compatibility with symlinks?
Move Util.py to /returnn/util/basic.py?
Move TaskSystem*.py to /returnn/util/task_system_*.py?
Move Debug*.py to /returnn/util/debug_*.py?
Use only clear abbreviations, but which ones? (e.g. recurrent -> rec)

About the last point, there are currently inconsistencies (NetworkRecurrentLayer vs TFNetworkRecLayer), so this needs to be resolved. The layer modules will have shorter names anyways, as they are moved to returnn/backend/layers, so writing the files in question without abbreviations (segment_model.py, signal_processing.py and recurrent.py) would not be an issue.

I would also vote for introducing op packages /returnn/tf/ops/ and /returnn/theano/ops/.

albertz commented 4 years ago

So, I guess we agreed on TFNetworkLayer.py to /returnn/tf/layers/base.py. Or maybe /returnn/tf/layers/basic.py to be consistent with Util.py below? What about TFNetwork.py?

Move TFNetworkRecLayer.py to /returnn/tf/rec_layers.py? Or /returnn/tf/layers/rec.py?

I vote for the latter.

Moving Theano layer files to /returnn/theano/layers?

Yes. In any case consistent to the TF structure.

Unify tool names, and keep compatibility with symlinks?

I would vote for no. This would just make it even more chaotic. But in any case, this can also be done (and discussed) independently from this issue, at some later point.

Move Util.py to /returnn/util/basic.py?

Ok.

Move TaskSystem*.py to /returnn/util/task_system_*.py?

Ok.

Move Debug*.py to /returnn/util/debug_*.py?

Ok.

Use only clear abbreviations, but which ones? (e.g. recurrent -> rec)

Let's just be specific and discuss for the cases we currently have here. I guess "rec" is fine. What else needs to be discussed?

About the last point, there are currently inconsistencies (NetworkRecurrentLayer vs TFNetworkRecLayer), so this needs to be resolved.

There is never a must. Maybe it would be nice to be consistent.

But this will anyway not be possible 100%. E.g. you will find sometimes "net", sometimes "network", used for variable names, class names, comments, whatever... But I think that's ok. Let's not be too pedantic.

In general, I prefer short names.

The layer modules will have shorter names anyways, as they are moved to returnn/backend/layers, so writing the files in question without abbreviations (segment_model.py, signal_processing.py and recurrent.py) would not be an issue.

They are not shorter, the complete names are in fact longer now (esp when we mostly use absolute imports).

I would also vote for introducing op packages /returnn/tf/ops/ and /returnn/theano/ops/.

Which files are moved there?

JackTemaki commented 4 years ago

So, I guess we agreed on TFNetworkLayer.py to /returnn/tf/layers/base.py. Or maybe /returnn/tf/layers/basic.py to be consistent with Util.py below?

Yes, I would say basic.py

I would also vote for introducing op packages /returnn/tf/ops/ and /returnn/theano/ops/.

Which files are moved there?

For Theano we have all Op* files, for Tensorflow there is only TFNativeOp so far, but maybe we want to have a package for future additions and keep the Theano/TF structure consistent. But of course this could be postponed to a later point.

Do we put SprintDataset.py to /returnn/sprint/ or to /returnn/datasets/?

albertz commented 4 years ago

Ok to basic.py.

ops is a bit more complicated. The file NativeOp.py, as well as all ops implemented for it, are independent from both Theano and TF. (This is actually a very nice feature. I abstracted this away when I ported the code from Theano to TF. Even for Theano, it was already generic in the way that it auto-created the ops both for CPU and GPU.) (The file actually included Theano code up until recently, but that was just because I did not clean it up properly and moved it to a separate file. I did that now. All Theano related code is in TheanoNativeOp.py.)

Do we put SprintDataset.py to /returnn/sprint/ or to /returnn/datasets/?

This is a question of dependencies also. Code becomes much easier to maintain if dependencies between each other (e.g. at the level of packages or modules) are kept minimal, and also only in one direction if possible.

E.g. Util.py (and thus everything in /returnn/util/) is supposed to be independent from anything else in RETURNN. Config.py and Log.py as well (although they might depend on util).

In some cases, I also like to keep the code even independent from Config and Log, if possible, such that it is also useful outside and independent of RETURNN. E.g. TFUtil.py is written in such a way.

datasets is also very independent, except that it might depend on Config, Log and Util. And depending on the dataset, on further optional external dependencies, like h5py or librosa.

I think it is most consistent if SprintDataset.py would be moved to /returnn/datasets/sprint.py, and this particular dataset has then additional dependencies on /returnn/sprint.

The stuff in /returnn/sprint is also very independent, except maybe Config and Log.

JackTemaki commented 4 years ago

The file NativeOp.py, as well as all ops implemented for it, are independent from both Theano and TF.

But the class NativeOp itself is Theano dependent, right? So it would be nice to separate this in the future. But here it does not matter yet, because the question was about the Theano specific Op*.py files (including TwoState*.py) and TFNativeOp.py.

I think it is most consistent if SprintDataset.py would be moved to /returnn/datasets/sprint.py

Ok good, I would have favoured that as well.

albertz commented 4 years ago

The file NativeOp.py, as well as all ops implemented for it, are independent from both Theano and TF.

But the class NativeOp itself is Theano dependent, right? So it would be nice to separate this in the future.

This is already separated.

But here it does not matter yet, because the question was about the Theano specific Op*.py files (including TwoState*.py) and TFNativeOp.py.

Yea, ok. But I would anyway probably not invest too much energy into big restructuring efforts for the Theano part.

For the TF part, there is only TFNativeOp.py (as a separate file) currently. Most similar ops are supposed to be implemented in this very generic way (independent of TF).

There are also some very small TF specific ops (only couple of lines of code) directly in TFUtil.py. Not sure if we should mix them with the other ops. TFNativeOp.py is conceptually different. Putting it together would only mix up things, and introduce further dependencies (e.g. ops in TFUtil.py are totally independent of anything from RETURNN; TFNativeOp.py depends on NativeOp.py).

So I'm not exactly sure what you suggest now.

NativeOp.py to /returnn/native_op.py?
TFNativeOp.py to /returnn/tf/ops/native.py? But that would be the only file in .../tf/ops at the moment. Also not sure if it make sense at some later point to move stuff from TFUtil.py here. Because we should try to keep independent code independent. Like .../tf/util. Actually it would be good to move more code there, not the other way around. TFNativeOp.py cannot be moved there, though, because that has the dependency on NativeOp.py. Or we ignore that. Or: TFNativeOp.py to /returnn/tf/util/ops/native.py? Given these considerations, I think I would actually prefer: TFNativeOp.py to /returnn/tf/native_op.py.

JackTemaki commented 4 years ago

This is already separated.

It did not look like that to me, otherwise the if BackendEngine.is_theano_selected(): would not be needed.

Yea, ok. But I would anyway probably not invest too much energy into big restructuring efforts for the Theano part.

This was only about getting the files away from the root, and it does not hurt moving them to /returnn/theano/ops/.

So I'm not exactly sure what you suggest now.

I did not suggest anything specific, it was only a question about having "ops" folders.

NativeOp.py to /returnn/native_op.py

Ok

Given these considerations, I think I would actually prefer: TFNativeOp.py to /returnn/tf/native_op.py

Ok

albertz commented 4 years ago

This is already separated.

It did not look like that to me, otherwise the if BackendEngine.is_theano_selected(): would not be needed.

This is exactly the code which separates it. Anyway, are you looking at the latest code? As said, I already moved this to a separate file.

This was only about getting the files away from the root, and it does not hurt moving them to /returnn/theano/ops/.

Ok.

JackTemaki commented 4 years ago

Anyway, are you looking at the latest code? As said, I already moved this to a separate file.

Ah, you just pushed it yesterday ok.

I updated the draft accordingly.

The remaining files should now be:

Fsa.py, it looks like a tool to me, so move to /tools/?
External.py as the name implies it is not used inside RETURNN, so maybe it can just be kept in /returnn/
NetworkStream.py I assume this is Theano only, so this can go into /returnn/theano/? or better /returnn/util/?

albertz commented 4 years ago

The remaining files should now be:

Fsa.py, it looks like a tool to me, so move to /tools/?

No, it's not a tool. It's a lib which provides some Fsa stuff. Which is also used in some other parts (e.g. NativeOp).

Some Python files additional have a small if __name__ == "__main__" part. In most cases, this is just for testing. Although sometimes it provides also some useful functionality. E.g. LearningRateControl.py or LmDataset.py are such examples. I'm not sure if it will still be possible to call these files directly after our restructuring. Maybe with some hacks or tricks. This is one of the problems I mentioned very initially about the restructuring.

External.py as the name implies it is not used inside RETURNN, so maybe it can just be kept in /returnn/

I'm pretty sure it is used somewhere. I think by the dataset. If it is not used at all, it can be deleted. It looks like this should be in Util.py. At least very similar things (Numpy utils) are all in Util.py. (Later it might make sense to split that up into Numpy stuff and other stuff.)

So maybe move to /returnn/util/external.py for now, and later restructure in a better way.

Or, before we restructure, we move the code right now already to Util.py, then this is cleaner.

Edit I just checked, actually it looks like this is not used at all. So it can be deleted. (Not sure why you say it should be kept.) Edit Done, deleted.

NetworkStream.py I assume this is Theano only, so this can go into /returnn/theano/? or better /returnn/util/?

I have no idea. This also looks broken. At least this is Python 2 only. Maybe completely delete the file? Edit I just checked, this is used by Device, i.e. by Theano code. So just move to /returnn/theano/.

JackTemaki commented 4 years ago

Maybe with some hacks or tricks.

We could e.g. add the returnn package to the path with the sys module when it is run as standalone.

No, it's not a tool. It's a lib which provides some Fsa stuff. Which is also used in some other parts (e.g. NativeOp).

Yes, but it seems there is code that is only used for the "demo", and the rest (maybe smaller part) is used only in TFNativeOp.py. But we can leave this for now and split this later if possible.

Is there anything missing now with respect to the structure?

albertz commented 4 years ago

Maybe with some hacks or tricks.

We could e.g. add the returnn package to the path with the sys module when it is run as standalone.

This is not enough. You would also need sth like this:

__path__ = [os.getcwd()]
__name__ = "returnn.%s" % os.path.splitext(os.path.basename(__file__))[0]  # not sure...
__package__ = "returnn.%s" % os.path.splitext(os.path.basename(__file__))[0]
sys.modules[__name__] = sys.modules["__main__"]

Not sure if this is enough then, and works like this... (See e.g. here.) But not so important now. We can make that work somehow.

Is there anything missing now with respect to the structure?

There are still some question marks in the proposal. And besides them, not sure if we forgot about sth.

JackTemaki commented 4 years ago

But not so important now. We can make that work somehow.

Ok, then lets move it to /returnn/util/.

The remaining question marks are:

Move TFNetworkSegModLayer.py to /returnn/tf/layers/seg_model.py?
Move TFNetworkSigProcLayer.py to /returnn/tf/layers/sig_proc.py?
Move Engine(Base|Batch|Util).py to /returnn/engine/?

I do not really like sig_proc.py, I would call it signal_processing.py or just signal.py if it has to be short. seg_model.py could also just be segmentation.py.

albertz commented 4 years ago

But not so important now. We can make that work somehow.

Ok, then lets move it to /returnn/util/.

Ok.

The remaining question marks are:

Move TFNetworkSegModLayer.py to /returnn/tf/layers/seg_model.py?

Move TFNetworkSigProcLayer.py to /returnn/tf/layers/sig_proc.py?

Move Engine(Base|Batch|Util).py to /returnn/engine/?

I do not really like sig_proc.py, I would call it signal_processing.py or just signal.py if it has to be short.

In what possible way can you misunderstand sig_proc.py? "sig" is always "signal", "proc" is always "process" or "processing". I think this is pretty clear, right? Just signal.py is bad, because this would be unclear what this means. There are many things about signals, e.g. the OS signals. There is even an official Python module with that name (which also makes heavy use of the "sig" abbreviation).

seg_model.py could also just be segmentation.py.

Just segmentation.py would be bad, because this sounds as if this is something very general about segmentation, while in fact this is very specific (way too specific actually) about the segmental model (and also only one specific kind of it). You say that "seg" is really so unclear? (For those people who don't know about segmental models, I don't think that "segmental_model" is really more helpful than "seg_model".)

rwth-i6 / returnn

Suggestions for discussing a more organized code structure #162