Closed ArneNx closed 4 years ago
Example projects (for examples of structuring):
/tensorflow/
, tests alongside with code in *_test.py
files/src/nlp/
, tests in /tests/
/allennlp/
, tests in /tests/
I was planning this since a while. But this is a bit non-trivial. There are multiple cases you have to think about:
When RASR imports RETURNN, it does so by importing the SprintInterface.py
module (or some of the others). I am not sure if it will correctly work if this is now part of a Python package (submodule). Esp think about relative imports or so.
The main entry point (rnn.py
) cannot be part of the package itself. So it has to be separate. Not sure if so important.
Python 2 and Python 3 behave slightly different when it comes to packages (as far as I remember).
All scripts/configs which use the old-style imports like TFUtil.safe_log
or TFNetworkLayer.DotLayer
should still work in any case (otherwise the wrapper would be wrong).
All scripts/configs which gradually updated the code, e.g. to returnn.tf.layers.DotLayer
or so, should also still work, even if we move it then to returnn.tf.layers.base.DotLayer
or whatever. The base package returnn.tf.layers
would import all others.
Draft about the structure:
/returnn/
(or further sub directories, see below).
rnn.py
, setup.py
, tools
, demos
, tests
, docs
(leave in /
)extern
, extern_private
, c_support_code
, cuda_implementation
_
between wordsrnn.py
to /returnn/__main__.py
./rnn.py
, but mostly just call to /returnn/__main__.py
.TF*.py
to /returnn/tf/*.py
with new name conventions . (Example: import TFEngine
becomes import returnn.tf.engine
)TFNetwork.py
to /returnn/tf/network.py
.TFNetworkLayer.py
to /returnn/tf/layers/basic.py
and also create /returnn/tf/layers/__init__.py
.TFNetworkRecLayer.py
to /returnn/tf/layers/rec.py
.TFNetworkSegModLayer.py
to /returnn/tf/layers/segmental_model.py
.TFNetworkSigProcLayer.py
to /returnn/tf/layers/signal_processing.py
.TFNetworkNeuralTransducer.py
to /returnn/tf/layers/neural_transducer.py
.TFUtil.py
to /returnn/tf/util/basic.py
./returnn/theano/
./returnn/theano/layers/
.Op*.py
, TwoState*
, CTC*
and BestPathDecoder*
to /returnn/theano/ops/
.Util.py
to /returnn/util/basic.py
.TaskSystem*.py
to /returnn/util/task_system_*.py
.Debug*.py
to /returnn/util/debug_*.py
.Sprint*.py
(except SprintDataset.py
) to /returnn/sprint/*.py
.Engine(Base|Batch|Util).py
to /returnn/engine/
*Dataset.py
to /returnn/datasets/
Fsa.py
to /returnn/util/fsa.py
Code structure:
from returnn.tf.layers.base import LayerBase
) except maybe a few exceptions for relative imports (from .base import LayerBase
).Other tasks: (Just an overview. Let's not go into too much details for now, unless there are real problems.)
setup.py
for the new structure./__init__.py
to handle a direct import of the repo directory itself.
This should just wrap all to /returnn/__init__.py
. Probably needs some path trickery.TFUtil
, TFNetworkLayers
, etc...
This could be handled via the lazy loader in /__init__.py
or /returnn/__init__.py
.
The lazy loader would otherwise not be needed anymore.Practical organization:
git mv
, such that the Git history stays clean, such that Git blame and co can correctly identify the rename/move of files; see here for a related question).I understand the first and second point. Concerning the third point, is there any plan to end the support for Python 2 in RETURNN? If this problem would be eliminated by relying on Python 3 conventions for handling packages, is there anything that prevents us from moving at least the layers in separate subfolders?
For me personally, I don't rely on Python 2. But there are many people who still use it (for various reasons). Maybe you just have an old setup with Python 2, and you want to test it again with a recent RETURNN version and some new feature X. Maybe other things break the setup when you switch over to Python 3. So, stopping Python 2 support should have a good reason. I would wait a bit with that. But anyway, maybe with respect to packages, this problem is solvable in some way (I don't remember exactly anymore).
Just moving some of it (like layers) to a sub package already introduce all the mentioned problems.
To revive this issue, I would like to add a new draft to tackle this:
Reasons for Restructuring
GeneratingDataset.py
, which contains Datasets made for tests, but also for specific ASR corpora and also vocabulary functions/classes and feature extraction pipelines.--patch
needs to be used often to select the relevant parts.Known Issues
There are multiple approaches used to access the Sprint. Common examples are:
pymod-name = crnn.SprintInterface
pymod-path = ./path/to/crnn/..
or
pymod-name = SprintInterface
pymod-path = ./path/to/crnn
where crnn is the old repository name of RETURNN. Those variants should continue to work. If the SprintInterface would now be part of a subpackage (e.g. returnn.sprint.SprintInterface
), absolute imports may break as the python path will not be set to the correct root directory. [1]
When running RETURNN just from the rnn.py
, the old imports need to be still valid, common is e.g.:
from Pretrain import WrapEpochValue
[2]. Also, gobal variables such as config
inside the config need to be present. [3]
All scipts that import from RETURNN will have broken dependencies when files are moved to subdirectories, so a root-level __init__.py
file needs to make sure that all old imports are still valid when the path is set to /path/to/returnn-git
[4], but also when calling import returnn
when the path is only /path/to/
and the repo name itself is returnn. [5]
All modifications need to be compatible to Python2, but no issues appeared yet. [6]
Proposal: I have a current test environment which was able to solve the Issues [1-3][5] and [6], but fails on [4].
This changes included:
returnn
folder.__init__.py
containg the lazy loader at the root-level, and adjusting the lazy loader to recursively add all modules, but store them under their name only to ensure that [1] and [5] are workingmain
function to returnn/main.py
, and adding from returnn.main import *
in the rnn.py
to ensure all content is still accessible.The test cases used were:
Todo:
About the motivation:
With the current file structure it is difficult to navigate trough the code without using the help of an IDE as e.g. PyCharm
I don't think there is much difference though, right? Actually, having a directory structure can make it even harder without an IDE to browse the code, and to get a good overview of the code.
But this is not really the point of restructuring anyway, right? No matter if you use an IDE or not, it would be cleaner. I think this is the only reason here. Which by itself is maybe good enough.
Very long files cause lags or interruptions in PyCharm during coding Large files make it difficult to see which parts exactly are changed in a commit
But this restructuring proposal here is about introducing packages and a directory structure. It is not about splitting up files. Splitting up files is a separate orthogonal thing. We can also do it without this restructuring. (Although, maybe it is nicer with the restructuring.) Or do you want to do both together here? Then I misunderstood the proposal. But this becomes a much bigger task then.
With the current file structure it is difficult to understand the code, especially it is not visible which files are used for both backends, and which are Theano only.
I would argue this is a very minor problem, and once you invested only a minimal amount of time, it is already clear (e.g. check the technological overview part in the documentation).
Many files contain arbitrary code, such as GeneratingDataset.py, which contains Datasets made for tests ...
This again seems like an orthogonal issue. Also, I'm not sure I would agree. GeneratingDatasets is for datasets which are generated. These are not just for unit tests. These are valid synthetic datasets which you often find used in research papers.
About the proposal (and also check my own proposal above):
moving all .py code files except rnn.py to a
returnn
folder.
Not better src/returnn
? But maybe returnn
is fine.
Let's see what other Python projects do (e.g. here).
Seems that returnn
would be more standard.
update all imports as absolute imports from the root level
Or relative?
returnn/main.py
Or returnn/__main__.py
?
Also, about the further TODOs:
Finding a solution for issue [4]
Not sure what [4] is exactly (there are no numbers; you mean GeneratingDatasets?). Also not sure why this is so relevant for this issue here. Let's not make it too complicated. Let's do it step by step. Focus on one thing here (which is the restructuring).
Further restructuring of the code
This is a bit ambiguous, and also, as I said, not so relevant. Let's do it step by step. Open separate issues for other things.
Ensuring all current test-cases run, as this proposal was made with an older RETURNN version
What do you mean by "this proposal was made with an older RETURNN version"? I thought this is a proposal for a future RETURNN version?
fixing remaining cyclic dependencies
What exactly? What is the problem?
It is not about splitting up files. Splitting up files is a separate orthogonal thing. We can also do it without this restructuring. (Although, maybe it is nicer with the restructuring.)
I thought that both is the aim: first, introducing a structure, and then splitting the files which are definetly too large, or containing very mixed content that should rather be not in a single file.
I would argue this is a very minor problem, and once you invested only a minimal amount of time
even if you can do it quite fast, it would be cleaner to have it separated.
This again seems like an orthogonal issue
Yes, but once there is a structure it is easier to do a clean separation of things that do not belong together.
GeneratingDatasets is for datasets which are generated.
But this is exactly what I mean, more than 50% of the content is not related to that.
about the proposal (and also check my own proposal above):
I am sorry that this lead to a confusion, this is not a theoretical proposal, but an actual proposal. I applied all those changes already in a local repository, and confirmed that they work under the situations mentioned.
Or relative?
yes, I used absolute and would be in favour of that, but maybe relative also works (or a combination)
Or returnn/main.py?
In this case the file should not be the executable, maybe main is just the wrong name. But this is open for discussion. __main__.py
is a reserved name for direct execution, so I am not sure if this should be used here.
Not sure what [4] is exactly
In issues I put numbers behind the most dominant problems, and use the numbers to refer to them.
What do you mean by "this proposal was made with an older RETURNN version"? I thought this is a proposal for a future RETURNN version?
I applied the changes to a RETURNN version of April, and just now looked at this again. I don't think that something important changed until now.
Also, the original Issue was about doing both, structuring and splitting files, as this should be discussed together. I am not sure if it makes sense to look at the problems separately. I think what you mean is that enabling a structure without breaking setups is a first step, and then actually designing the structure and refactoring the files is a second step.
I thought that both is the aim: first, introducing a structure, and then splitting the files which are definetly too large, or containing very mixed content that should rather be not in a single file.
But as you say, these are two steps. Let's not mix them up here. And we should definitely do it step by step, otherwise the change will be too big and too annoying to handle and discuss.
__main__.py
is a reserved name for direct execution
See the docs. There is not much special to it. It only allows to run the module via python -m returnn
.
I would argue that the name main.py
does not make much sense if this is supposed to be the main entry point. Then we can directly use it as __main__.py
.
Right?
Right?
I thought that we keep an rnn.py
file at the root level as the entry point (which is definitely needed for backwards compatibility), but we could actually support python -m returnn
as well if there is an __main__.py
file with a "main" function inside, you are right about that.
fixing remaining cyclic dependencies
What exactly? What is the problem?
I forgot to mention this, while moving the code, at some cases cyclic dependencies were suddenly an issue. I fixed some of those, but there were some remaining cases that are not appearing while running the setups, but appear in the RETURNN tests.
This is an example, but I think this is detail work and does not need to be discussed here:
File "./returnn_test_version/tests/test_SprintInterface.py", line 12, in <module>
from returnn.theano_backend.Engine import Engine
File "./returnn/theano_backend/Engine.py", line 12, in <module>
from returnn.theano_backend.Network import LayerNetwork
File "./returnn/theano_backend/Network.py", line 10, in <module>
from returnn.theano_backend.NetworkLayer import get_layer_class
File "./returnn/theano_backend/NetworkLayer.py", line 26, in <module>
_initLayerClasses()
File "./returnn/theano_backend/NetworkLayer.py", line 13, in _initLayerClasses
from returnn.theano_backend.NetworkOutputLayer import FramewiseOutputLayer
File "./returnn/theano_backend/NetworkOutputLayer.py", line 17, in <module>
from returnn.sprint.SprintErrorSignals import sprint_loss_and_error_signal, SprintAlignmentAutomataOp
File "./returnn/sprint/SprintErrorSignals.py", line 20, in <module>
from returnn import TaskSystem, Device
File "./returnn/Device.py", line 8, in <module>
from returnn.theano_backend.Network import LayerNetwork
ImportError: cannot import name 'LayerNetwork'
Also, the original Issue was about doing both, structuring and splitting files, as this should be discussed together.
No. Why do you think so?
Also, this would make this issue much more complex, and would require much more planning. This is always very annoying and makes it more complicated than needed. Even the discussion here becomes already way too lengthy because of this.
I am not sure if it makes sense to look at the problems separately.
Why not? These are two independent issues.
I would also say that splitting up the code files can be an iterative process. You don't need to do it all at once. You can do it step by step. It does not matter. So that's why you can also not do it at all for now, and keep this as a separate step. Which makes it much easier, i.e. the discussion here, the planning, etc.
I think what you mean is that enabling a structure without breaking setups is a first step, and then actually designing the structure and refactoring the files is a second step.
Without breaking setups is yet another thing. But this is what we anyway always must have, so that's why I did maybe not mention this explicitly. So this is a must. Both for the restructuring and also for splitting up files. In both cases, some extra care need to be taken for this aspect.
I forgot to mention this, while moving the code, at some cases cyclic dependencies were suddenly an issue.
I don't understand. Why can there suddenly be a cyclic dependency?
Or is this again not about the restructuring but about splitting up files? As I said, let's focus here on just the restructuring, nothing else. Let's do it step by step.
No. Why do you think so?
It was explicitly said: "I would suggest moving the layers into a folder structure similar to the organization in the docu"
Also, this would make this issue much more complex, and would require much more planning.
Yes, I thought this is what we want to do.
I would also say that splitting up the code files can be an iterative process.
This can be dangerous, as for each iteration classes will move. If you do not care about scripts that access classes directly breaking in the process again, this is fine. But I thought you want to avoid this, and move all at once, so that after the "restructuring" everything has its fixed place. Of course this makes it complicated. I would be fine doing this step by step, but this causes problems (possible incompatiblity to setups written during the process) we usually want to avoid.
Without breaking setups is yet another thing. [...] So this is a must.
Of course, I never said something contrary
I don't understand. Why can there suddenly be a cyclic dependency?
This appeared when just moving the files into submodules (no splitting), but this is a minor Issue.
I would also say that splitting up the code files can be an iterative process.
This can be dangerous, as for each iteration classes will move. If you do not care about scripts that access classes directly breaking in the process again, this is fine. But I thought you want to avoid this, and move all at once, so that after the "restructuring" everything has its fixed place. Of course this makes it complicated. I would be fine doing this step by step, but this causes problems (possible incompatiblity to setups written during the process) we usually want to avoid.
This should not happen.
TFUtil.safe_log
or TFNetworkLayer.DotLayer
should still work in any case (otherwise the wrapper would be wrong).returnn.tf.layers.DotLayer
or so, should also still work, even if we move it then to returnn.tf.layers.base.DotLayer
or whatever. The base package returnn.tf.layers
would import all others.Also, this would make this issue much more complex, and would require much more planning.
Yes, I thought this is what we want to do.
But this is basically an infinite big and complex task which is not possible to finish (at least at once).
We have not even started yet. The only proposal discussed here is that code files are moved to returnn/
.
returnn/tf/
.returnn/theano/
.TFNetworkLayer.py
to returnn/tf/layers.py
.But then, what files to split, what package/mod names, etc etc. There are really lots of things. I really don't see any other way except of doing this gradually.
Btw, another thing which we should plan a bit: How to organize this change? I.e. should it be a PR (based on a branch here), or directly be done in master? The problem is also, all other PRs will likely become somewhat broken. As well as all forks with custom changes. As well as all branches. Also, when this is worked on in a PR/branch, we would need to make sure that no changes are pushed in the meanwhile to master, because otherwise this would diverge.
Well, maybe instead of discussing now too much what exactly should all be part of this issue and the initial change, let's just work on the proposal more (i.e. concretize in detail what we want to change), and then maybe it becomes anyway more clear what we can do right now, and what maybe later.
But I'm still sure that we will not manage to make it perfect and complete in a single change/step. So in any case, we should think about allowing further changes/improvements later on. It would be bad if we design/organize it in a way which would not allow for easy further improvements later on (without breaking things again and again). But as I said, I don't really see a big problem there.
Also, for working on the proposal, relevant people should be involved, or at least take a look at this. I will assign them. (Edit: I assigned maybe more people than necessary. But maybe you are all interested in taking a look at this.)
I really don't see any other way except of doing this gradually.
My initial idea was to reserve a week for all people that are involved in this, and as such keeping the time in which might be new PRs / changes to the master as small as possible. But I guess that is not doable.
Btw, another thing which we should plan a bit: How to organize this change?
My Idea was to have a personal meeting (maybe not physical, but via e.g. Zoom) to work on this things, as discussing all details in text form will take a lot of time. Right now, the number of open pull requests is quite low, and #189 and #285 would not be affected by the changes, so only 2 pull requests will be a problem.
I.e. should it be a PR (based on a branch here)
I think that is the better approach as everyone can do testing without directly changing the master
As well as all forks with custom changes. As well as all branches.
People that did not try to merge for many months/year will have to do new changes anyway, and I think some never intended to merge/rebase. I am not sure about the branches in the main repository, like e.g. PyTorch. But I guess this adds mostly new files.
The only proposal discussed here is that code files are moved to returnn/.
The actual position is only a very minor part, as collecting the test cases and designing the wrapper parts is what I considered the main issue. I did not add this to the proposal, because this is not something that I can decide myself. To test the general capabilities of the wrapper I would do the following changes:
And not moving the rest of the files. If the files itself are not renamed, the lazy loader just needs to scan the files recursively. If we want to change the names, we might need to create something like a "compat" package containing wrappers with the original file names, or introduce a map in the lazy loader that maps the old module names into new module names. But without testing it I can not tell what is a good idea.
The actual position is only a very minor part, as collecting the test cases and designing the wrapper parts is what I considered the main issue. I did not add this to the proposal, because this is not something that I can decide myself.
I don't understand. This issue here is about the restructuring (i.e. moving the code). This is what we should discuss. This is not a minor part. This is the main part here. I.e. how to restructure the code.
If you think this the actual position is not really so important, we could also simply close the whole issue here with this argument.
All other things are minor. How to write the wrapper, how to fix up some scripts or fix other problems. These are small minor technical details, which are all simple (even trivial) to fix. If you think this is difficult, I can do it. But don't think about this now too much. We can always come up with solutions. This is not relevant now. And also these are hard to discuss all in advance anyway, because we will most likely not think about all possible edge cases. So this is sth we will just see during testing.
move all files with "Sprint" in the name to "returnn/sprint" (or maybe we rename it to rasr now)
This is somewhat contradictory to your other points. Most Sprint related things are either a dataset, or related to Theano, or related to TF. Some modules are also supposed to be imported by Sprint itself (e.g. for the dataset). There is I think only a single module which is imported by Sprint which would both be used for TF and Theano.
And not moving the rest of the files.
I don't understand. You still would move all Python files to /returnn/
, right?
The lazy loader in the normal case should not need to be used at all. There should be no magic happening for the normal case. Any magic (like a lazy loader) is always bad (to understand easily what is happening). The lazy loader should only help to recover the unusual cases, which would be broken otherwise. We could even maybe add deprecation warnings whenever the lazy loader is used.
or introduce a map in the lazy loader that maps the old module names into new module names.
Yes, I was thinking about sth like that. It also would become more complex once code files are splitted. A simple mapping (on file level) is not enough then. But the lazy loader can be extended in whatever way (and this can also be done gradually).
What about other files? I guess tools
and demos
would stay there. tests
also moved to /returnn/tests
. setup.py
and rnn.py
would stay in /
, but all other code files moved to /returnn/
. extern
and extern_private
also moved to /returnn/
. As well as c_support_code
and cuda_implementation
. docs
would stay in /
.
I don't understand. This issue here is about the restructuring (i.e. moving the code). This is what we should discuss.
Ok it seems there was a misunderstanding of what is part of this issue and what should be excluded.
The lazy loader in the normal case should not need to be used at all.
Yes, this is only for the Sprint related functions. Maybe we find a solution to remove it completely.
But then lets not discuss any of the implementation points any further here, and stick to the actual structuring:
Updated Proposal
Datasets:
*Dataset*.py
-> returnn/datasets/
BundleFile.py
and NormalizationData.py
are used only by the StereoDataset
, and should be there as wellEngine:
The following files moved to returnn/engine/
:
EngineBase.py
EngineBatch.py
, although this is also very related to the datasetsEngineUtil.py
Tensorflow:
The following files are moved to returnn/tf/layers
:
TFNetworkLayer.py
TFNeuralTransducer.py
TFNetworkRecLayer.py
TFNetworkSegModLayer.py
TFNetworkSigProcLayer.py
All otherTF*.py
-> returnn/tf/
Theano:
The following files moved to returnn/theano/
:
ActivationFunctions.py
BestPathDecoder.*
CTC.*
Device.py
Engine.py
EngineTask.py
Inv.*
TheanoUtil.py
Updater.py
The following files moved to returnn/theano/ops/
:
Op*.py
FunctionLoader.py
CustomLstmFunctions.py
TwoStateBestPathDecoder.*
TwoStateHMMOP.py
The following files moved to returnn/theano/network/
:
Network*.py
Sprint:
The following files moved to returnn/sprint/
:
SprintControl.py
, which is used by tensorflow onlySprintErrorSignals.py
, which is used by both backendsSprintExternInterface.py
SprintInterface.py
Remaining
Config.py
Debug.py
DebugHelpers.py
HyperParamTuning.py
Log.py
Pretrain.py
TaskSystem.py
Util.py
Also there are some files which look more like tools than core components, such as:
SprintCache.py
The file Server.py
looks like it works with Theano only.
The rnn.py
functionality can be moved to returnn\__main__.py
, but the file itself needs to remain at the root level.
I think it is more common for python projects to have the tests
folder at the root level, but I am not sure.
It would be really nice if we could get access to at least one Theano setup. @doetsch @doetsch-apptek maybe?
What naming convention for module names do you want to follow? All lower case and with _
between words? Or CamelCase like we have?
Tensorflow:
The following files are moved to
returnn/tf/layers
:
TFNetworkLayer.py
TFNeuralTransducer.py
TFNetworkRecLayer.py
TFNetworkSegModLayer.py
TFNetworkSigProcLayer.py
How would you call the files then? I would definitely remove the prefix TF
(redundant) as well as the postfix Layer
(redundant). Not sure about Network
. Maybe the files should anyway be better moved to /returnn/tf/network/
?
E.g. rec.py
? Or RecLayer.py
(when inside .../network/
)? Or rec_layer.py
?
And base.py
?
The following files moved to
returnn/theano/network/
:
- All
Network*.py
But this is inconsistent to the TF part. I think it should be consistent.
Sprint:
The following files moved to
returnn/sprint/
:
SprintControl.py
, which is used by tensorflow only
No, wrong. This is used by the dataset as well.
SprintErrorSignals.py
, which is used by both backendsSprintExternInterface.py
SprintInterface.py
How to call these files? I assume we should remove the Sprint
prefix. And follow the common naming conventions, i.e. e.g. control.py
, error_signals.py
, etc.
Remaining
Config.py
Debug.py
DebugHelpers.py
HyperParamTuning.py
Log.py
Pretrain.py
TaskSystem.py
Util.py
Also LearningRateControl.py
, and maybe others you forgot.
And where would you put them? Just directly to /returnn/
?
Also there are some files which look more like tools than core components, such as:
SprintCache.py
This was copied from some other code or repo (don't remember exactly anymore). I think it is used for SprintCacheDataset
.
The file
Server.py
looks like it works with Theano only.
Yes, possible. But what's the conclusion? Or what is specific here? Should be moved to /returnn/theano/
, right? Or what is unclear here?
I think it is more common for python projects to have the
tests
folder at the root level, but I am not sure.
Yes, ok.
I would go for "snake_case" for all file names as it is recommended in PEP8.
And, if we don't have "TF" and "layer" in the file names anymore, I would vote for avoiding abbreviations as long as the length of the filename stays reasonable (e.g. TFNetworkRecLayer.py
-> /returnn/tf/layers/recurrent.py
, TFNetworkSigProcLayer.py
-> /returnn/tf/layers/signal_processing.py
).
Looks like we could need /returnn/util
containing what is now Util.py
, TaskSystem*.py
, and maybe Debug*.py
to keep the /returnn
directory clean. But things like Config.py
and Log.py
can simply be put there, I would say.
But this is inconsistent to the TF part. I think it should be consistent.
Good point, then I suggest using also returnn/theano/layers
for "Network*Layer.py" files.
But then we would have those files remaining:
Network.py
NetworkCopyUtils.py
No, wrong. This is used by the dataset as well.
I trusted the "find usage" of PyCharm again, yes this was completely wrong
Should be moved to /returnn/theano/, right? Or what is unclear here?
Yes, I just wanted to raise the question
I would vote for avoiding abbreviations as long as the length of the filename stays reasonable
That sounds good.
I would go for "snake_case" for all file names as it is recommended in PEP8. And, if we don't have "TF" and "layer" in the file names anymore, I would vote for avoiding abbreviations as long as the length of the filename stays reasonable (e.g.
TFNetworkRecLayer.py
->/returnn/tf/layers/recurrent.py
,TFNetworkSigProcLayer.py
->/returnn/tf/layers/signal_processing.py
).
Are you sure? Sometimes we still have multiple words in there, and then this could become very long. Or we would sometimes use abbreviations, and sometimes the full names, but this is inconsistent, which is then also somewhat ugly. It also becomes somewhat Java-like then, which is very uncommon in the Python world.
Looks like we could need
/returnn/util
containing what is nowUtil.py
,TaskSystem*.py
, and maybeDebug*.py
to keep the/returnn
directory clean.
You mean /returnn/util/util.py
etc?
But things like
Config.py
andLog.py
can simply be put there, I would say.
You mean to /returnn/config.py
and /returnn/log.py
?
Well, the objective should be how easy it is to read. Things like "Op" and "CTC" are clear of course (or at least writing those out wouldn't help), but without having looked into them I honestly don't know what "Inv.py" and "TFNetworkSegModLayer.py" are about. Currently there are not many of those cases anyway. But seg_mod.py
? Nah ;P
You mean /returnn/util/util.py etc? You mean to /returnn/config.py and /returnn/log.py?
Yes and yes.
/returnn/util/util.py
That does not look too nice.... would we generally accept to have modules in an __init__.py
file, or do we want to avoid that? Otherwise Util.py
-> /returnn/util/__init__.py
would be ok
Sometimes we still have multiple words in there, and then this could become very long.
We should then think of a different name for those extreme cases (like two_state_best_path_decoder.py
, but for everything that can be composed of two or maybe 3 words I would rather use the full words instead of unclear abbreviations.
Well, the objective should be how easy it is to read. Things like "Op" and "CTC" are clear of course (or at least writing those out wouldn't help), but without having looked into them I honestly don't know what "Inv.py" and "TFNetworkSegModLayer.py" are about. Currently there are not many of those cases anyway. But
seg_mod.py
? Nah ;P
But the problem here is not so much the abbreviation. Having "inverted" instead of "inv", or "segmental" instead of "seg", or "module" instead of "mod", that doesn't really change much, or does it? I think the names itself might be better chosen, but I still would keep using very standard abbreviations ("inv", "seg", "mod", etc). Actually your examples are I think mostly for outdated/unused code anyway.
/returnn/util/util.py
That does not look too nice.... would we generally accept to have modules in an
__init__.py
file, or do we want to avoid that? OtherwiseUtil.py
->/returnn/util/__init__.py
would be ok
No, I think we should not put too much code into __init__.py
.
Maybe /returnn/util/basic.py
instead?
... I would rather use the full words instead of unclear abbreviations.
Yea, sure. I was anyway speaking mostly about clear and very common abbreviations.
or "module" instead of "mod"
This is exactly why we should avoid it. This abbreviation was intended to be "segment model" not "segment module".
And even inverse.py
would be a bad name... inverse of what? Inverse model? But then inverse model with respect to? To be honest, even when looking shortly over the code (e.g. class names) I have no idea what it does, as there is also AlignOp and InvAlignOp. Of course this is because of bad naming and non-existing docstrings, but the file name can be better as well.
This is exactly why we should avoid it. This abbreviation was intended to be "segment model" not "segment module".
Well, "mod" is definitely not common for "model", but it is very common for "module", esp in the context of Python. I would say, for Python code, having "mod" for "module" is always totally ok. I.e., it is not ok for "model", so "model" should be written out. I think this is a pretty straight-forward case.
inverse.py would be a bad name... inverse of what? Inverse model?
You can extend it, but this is not about the abbreviation here. Call it inv_model.py
or so.
(I don't really know what this is about. It's anyway too specific, and does not belong to RETURNN at all, just as well as the segmental model, so these are anyway very bad examples.)
I would still say, for common abbreviations, we should definitely keep using the abbreviations. util
is also such an example. Even returnn
itself. Or tf
. Or config
. It does not make sense to write out these things.
In the same manner, I would say that rec
is just as clear. Or sig
. Or proc
. Or param
.
This is really pretty standard in Python code, and I think we should stick to Python conventions. E.g. Python uses def
, init
, str
, int
, repr
, etc.
So, this effectively becomes more a discussion now about which are the cases which are ok. Let's also not put too much priority on these outdated code files which are anyway very bad examples.
Also, this is a question about consistency. Should we use both "rec" and "recurrent". Or only one of these? RecurrentLayer
but _SubnetworkRecWrappedLoss
?
(Or maybe to a certain degree, it's not too much a problem... I guess you will find both "param" and "parameter" somewhere in the code...)
It would be really nice if we could get access to at least one Theano setup. @doetsch @doetsch-apptek maybe?
What do you mean with theano setup? A training setup? I can provide you with that
there appears to be some work on support for a pytorch backend
Yes, but does it bother you in this branch? If it does then we can delete it, I still have a local copy of this backend and nobody is using it anyway (I just wouldn't want to delete it completely).
In general I think is a good thing to do this code restructuring. We could also make the theano RETURNN a standalone version which people have to checkout explicitly, while the actual RETURNN version only contains the TF backend.
This is also a good opportunity to unify the use of underscores and dashes under tools/
.
This is also a good opportunity to unify the use of underscores and dashes under
tools/
.
Not sure about that. We are very careful here to not break old configs and scripts. This would likely break some scripts. We could add symlinks for old name -> new name for the tools. But not sure if this is really nicer than just leaving them as they are.
I just saw that these two points are still open in the draft:
Move
TFNetworkLayer.py
to/returnn/tf/layers.py
? Or/returnn/tf/layers/base.py
and also create/returnn/tf/layers/__init__.py
?. MoveTFNetworkRecLayer.py
to/returnn/tf/rec_layers.py
? Or/returnn/tf/layers/rec.py
.
I would definitely favour to use /returnn/tf/layers/base.py
and /returnn/tf/layers/rec.py
, as we want to support future splitting of the layer files.
There are still some open questions:
TFNetworkRecLayer.py
to /returnn/tf/rec_layers.py
? Or /returnn/tf/layers/rec.py
?/returnn/theano/layers
?Util.py
to /returnn/util/basic.py
?TaskSystem*.py
to /returnn/util/task_system_*.py
?Debug*.py
to /returnn/util/debug_*.py
?About the last point, there are currently inconsistencies (NetworkRecurrentLayer
vs TFNetworkRecLayer
), so this needs to be resolved. The layer modules will have shorter names anyways, as they are moved to returnn/backend/layers
, so writing the files in question without abbreviations (segment_model.py
, signal_processing.py
and recurrent.py
) would not be an issue.
I would also vote for introducing op
packages /returnn/tf/ops/
and /returnn/theano/ops/
.
So, I guess we agreed on TFNetworkLayer.py
to /returnn/tf/layers/base.py
.
Or maybe /returnn/tf/layers/basic.py
to be consistent with Util.py
below?
What about TFNetwork.py
?
- Move
TFNetworkRecLayer.py
to/returnn/tf/rec_layers.py
? Or/returnn/tf/layers/rec.py
?
I vote for the latter.
- Moving Theano layer files to
/returnn/theano/layers
?
Yes. In any case consistent to the TF structure.
- Unify tool names, and keep compatibility with symlinks?
I would vote for no. This would just make it even more chaotic. But in any case, this can also be done (and discussed) independently from this issue, at some later point.
- Move
Util.py
to/returnn/util/basic.py
?
Ok.
- Move
TaskSystem*.py
to/returnn/util/task_system_*.py
?
Ok.
- Move
Debug*.py
to/returnn/util/debug_*.py
?
Ok.
- Use only clear abbreviations, but which ones? (e.g. recurrent -> rec)
Let's just be specific and discuss for the cases we currently have here. I guess "rec" is fine. What else needs to be discussed?
About the last point, there are currently inconsistencies (
NetworkRecurrentLayer
vsTFNetworkRecLayer
), so this needs to be resolved.
There is never a must. Maybe it would be nice to be consistent.
But this will anyway not be possible 100%. E.g. you will find sometimes "net", sometimes "network", used for variable names, class names, comments, whatever... But I think that's ok. Let's not be too pedantic.
In general, I prefer short names.
The layer modules will have shorter names anyways, as they are moved to
returnn/backend/layers
, so writing the files in question without abbreviations (segment_model.py
,signal_processing.py
andrecurrent.py
) would not be an issue.
They are not shorter, the complete names are in fact longer now (esp when we mostly use absolute imports).
I would also vote for introducing
op
packages/returnn/tf/ops/
and/returnn/theano/ops/
.
Which files are moved there?
So, I guess we agreed on TFNetworkLayer.py to /returnn/tf/layers/base.py. Or maybe /returnn/tf/layers/basic.py to be consistent with Util.py below?
Yes, I would say basic.py
I would also vote for introducing op packages /returnn/tf/ops/ and /returnn/theano/ops/.
Which files are moved there?
For Theano we have all Op*
files, for Tensorflow there is only TFNativeOp
so far, but maybe we want to have a package for future additions and keep the Theano/TF structure consistent. But of course this could be postponed to a later point.
Do we put SprintDataset.py
to /returnn/sprint/
or to /returnn/datasets/
?
Ok to basic.py
.
ops
is a bit more complicated.
The file NativeOp.py
, as well as all ops implemented for it, are independent from both Theano and TF. (This is actually a very nice feature. I abstracted this away when I ported the code from Theano to TF. Even for Theano, it was already generic in the way that it auto-created the ops both for CPU and GPU.) (The file actually included Theano code up until recently, but that was just because I did not clean it up properly and moved it to a separate file. I did that now. All Theano related code is in TheanoNativeOp.py
.)
Do we put
SprintDataset.py
to/returnn/sprint/
or to/returnn/datasets/
?
This is a question of dependencies also. Code becomes much easier to maintain if dependencies between each other (e.g. at the level of packages or modules) are kept minimal, and also only in one direction if possible.
E.g. Util.py
(and thus everything in /returnn/util/
) is supposed to be independent from anything else in RETURNN.
Config.py
and Log.py
as well (although they might depend on util
).
In some cases, I also like to keep the code even independent from Config
and Log
, if possible, such that it is also useful outside and independent of RETURNN. E.g. TFUtil.py
is written in such a way.
datasets
is also very independent, except that it might depend on Config
, Log
and Util
.
And depending on the dataset, on further optional external dependencies, like h5py
or librosa
.
I think it is most consistent if SprintDataset.py
would be moved to /returnn/datasets/sprint.py
, and this particular dataset has then additional dependencies on /returnn/sprint
.
The stuff in /returnn/sprint
is also very independent, except maybe Config
and Log
.
The file NativeOp.py, as well as all ops implemented for it, are independent from both Theano and TF.
But the class NativeOp
itself is Theano dependent, right? So it would be nice to separate this in the future. But here it does not matter yet, because the question was about the Theano specific Op*.py
files (including TwoState*.py
) and TFNativeOp.py
.
I think it is most consistent if
SprintDataset.py
would be moved to/returnn/datasets/sprint.py
Ok good, I would have favoured that as well.
The file NativeOp.py, as well as all ops implemented for it, are independent from both Theano and TF.
But the class
NativeOp
itself is Theano dependent, right? So it would be nice to separate this in the future.
This is already separated.
But here it does not matter yet, because the question was about the Theano specific
Op*.py
files (includingTwoState*.py
) andTFNativeOp.py
.
Yea, ok. But I would anyway probably not invest too much energy into big restructuring efforts for the Theano part.
For the TF part, there is only TFNativeOp.py
(as a separate file) currently.
Most similar ops are supposed to be implemented in this very generic way (independent of TF).
There are also some very small TF specific ops (only couple of lines of code) directly in TFUtil.py
. Not sure if we should mix them with the other ops. TFNativeOp.py
is conceptually different. Putting it together would only mix up things, and introduce further dependencies (e.g. ops in TFUtil.py
are totally independent of anything from RETURNN; TFNativeOp.py
depends on NativeOp.py
).
So I'm not exactly sure what you suggest now.
NativeOp.py
to /returnn/native_op.py
?TFNativeOp.py
to /returnn/tf/ops/native.py
? But that would be the only file in .../tf/ops
at the moment. Also not sure if it make sense at some later point to move stuff from TFUtil.py
here. Because we should try to keep independent code independent. Like .../tf/util
. Actually it would be good to move more code there, not the other way around. TFNativeOp.py
cannot be moved there, though, because that has the dependency on NativeOp.py
. Or we ignore that.
Or: TFNativeOp.py
to /returnn/tf/util/ops/native.py
?
Given these considerations, I think I would actually prefer: TFNativeOp.py
to /returnn/tf/native_op.py
.This is already separated.
It did not look like that to me, otherwise the if BackendEngine.is_theano_selected():
would not be needed.
Yea, ok. But I would anyway probably not invest too much energy into big restructuring efforts for the Theano part.
This was only about getting the files away from the root, and it does not hurt moving them to /returnn/theano/ops/
.
So I'm not exactly sure what you suggest now.
I did not suggest anything specific, it was only a question about having "ops" folders.
NativeOp.py to /returnn/native_op.py
Ok
Given these considerations, I think I would actually prefer: TFNativeOp.py to /returnn/tf/native_op.py
Ok
This is already separated.
It did not look like that to me, otherwise the
if BackendEngine.is_theano_selected():
would not be needed.
This is exactly the code which separates it. Anyway, are you looking at the latest code? As said, I already moved this to a separate file.
This was only about getting the files away from the root, and it does not hurt moving them to
/returnn/theano/ops/
.
Ok.
Anyway, are you looking at the latest code? As said, I already moved this to a separate file.
Ah, you just pushed it yesterday ok.
I updated the draft accordingly.
The remaining files should now be:
Fsa.py
, it looks like a tool to me, so move to /tools/
?External.py
as the name implies it is not used inside RETURNN, so maybe it can just be kept in /returnn/
NetworkStream.py
I assume this is Theano only, so this can go into /returnn/theano/
? or better /returnn/util/
?The remaining files should now be:
Fsa.py
, it looks like a tool to me, so move to/tools/
?
No, it's not a tool. It's a lib which provides some Fsa stuff. Which is also used in some other parts (e.g. NativeOp).
Some Python files additional have a small if __name__ == "__main__"
part. In most cases, this is just for testing. Although sometimes it provides also some useful functionality. E.g. LearningRateControl.py
or LmDataset.py
are such examples. I'm not sure if it will still be possible to call these files directly after our restructuring. Maybe with some hacks or tricks. This is one of the problems I mentioned very initially about the restructuring.
External.py
as the name implies it is not used inside RETURNN, so maybe it can just be kept in/returnn/
I'm pretty sure it is used somewhere. I think by the dataset. If it is not used at all, it can be deleted.
It looks like this should be in Util.py
. At least very similar things (Numpy utils) are all in Util.py
. (Later it might make sense to split that up into Numpy stuff and other stuff.)
So maybe move to /returnn/util/external.py
for now, and later restructure in a better way.
Or, before we restructure, we move the code right now already to Util.py
, then this is cleaner.
Edit I just checked, actually it looks like this is not used at all. So it can be deleted. (Not sure why you say it should be kept.) Edit Done, deleted.
NetworkStream.py
I assume this is Theano only, so this can go into/returnn/theano/
? or better/returnn/util/
?
I have no idea.
This also looks broken. At least this is Python 2 only.
Maybe completely delete the file?
Edit I just checked, this is used by Device
, i.e. by Theano code. So just move to /returnn/theano/
.
Maybe with some hacks or tricks.
We could e.g. add the returnn
package to the path with the sys module when it is run as standalone.
No, it's not a tool. It's a lib which provides some Fsa stuff. Which is also used in some other parts (e.g. NativeOp).
Yes, but it seems there is code that is only used for the "demo", and the rest (maybe smaller part) is used only in TFNativeOp.py
. But we can leave this for now and split this later if possible.
Is there anything missing now with respect to the structure?
Maybe with some hacks or tricks.
We could e.g. add the
returnn
package to the path with the sys module when it is run as standalone.
This is not enough. You would also need sth like this:
__path__ = [os.getcwd()]
__name__ = "returnn.%s" % os.path.splitext(os.path.basename(__file__))[0] # not sure...
__package__ = "returnn.%s" % os.path.splitext(os.path.basename(__file__))[0]
sys.modules[__name__] = sys.modules["__main__"]
Not sure if this is enough then, and works like this... (See e.g. here.) But not so important now. We can make that work somehow.
Is there anything missing now with respect to the structure?
There are still some question marks in the proposal. And besides them, not sure if we forgot about sth.
But not so important now. We can make that work somehow.
Ok, then lets move it to /returnn/util/
.
The remaining question marks are:
TFNetworkSegModLayer.py
to /returnn/tf/layers/seg_model.py
?TFNetworkSigProcLayer.py
to /returnn/tf/layers/sig_proc.py
?Engine(Base|Batch|Util).py
to /returnn/engine/
?I do not really like sig_proc.py
, I would call it signal_processing.py
or just signal.py
if it has to be short.
seg_model.py
could also just be segmentation.py
.
But not so important now. We can make that work somehow.
Ok, then lets move it to
/returnn/util/
.
Ok.
The remaining question marks are:
- Move
TFNetworkSegModLayer.py
to/returnn/tf/layers/seg_model.py
?- Move
TFNetworkSigProcLayer.py
to/returnn/tf/layers/sig_proc.py
?- Move
Engine(Base|Batch|Util).py
to/returnn/engine/
?I do not really like
sig_proc.py
, I would call itsignal_processing.py
or justsignal.py
if it has to be short.
In what possible way can you misunderstand sig_proc.py
? "sig" is always "signal", "proc" is always "process" or "processing". I think this is pretty clear, right?
Just signal.py
is bad, because this would be unclear what this means. There are many things about signals, e.g. the OS signals. There is even an official Python module with that name (which also makes heavy use of the "sig" abbreviation).
seg_model.py
could also just besegmentation.py
.
Just segmentation.py
would be bad, because this sounds as if this is something very general about segmentation, while in fact this is very specific (way too specific actually) about the segmental model (and also only one specific kind of it).
You say that "seg" is really so unclear? (For those people who don't know about segmental models, I don't think that "segmental_model" is really more helpful than "seg_model".)
I understand that removing/ deactivating the Theano backend is not an option right now. I don't have enough insight into these parts to know whether moving the corresponding files into a separate folder (instead of adding a prefix to every file) would break something.
However, there appears to be some work on support for a pytorch backend (at least there is a new branch with work on this). Would it be possible to move this stuff to a separate folder directly instead of adding more files to the root folder?
In general, I think it might be a good idea to collect some ideas on how we could improve the structure of the code. I would suggest moving the layers into a folder structure similar to the organization in the docu by @JackTemaki. (see: https://returnn.readthedocs.io/en/latest/layer_reference/index.html#layer-types )