Dataset refactor - Githubissues

jindrahelcl commented 6 years ago

This PR introduces a new way how to construct a dataset:

[train_data]
series=["source", "target", "source_prep", "computed"]
data=["/path/src", ("/path/tgt", <reader>), (<preprocessor>, "source"), <dataset_level_prep>]
outputs=[("target", "/path/out/tgt"), ("computed", "/path/out/cmp", <writer>)]
shuffled=True
buffer_size=1024

Details:

Laziness of a dataset is now captured just in an object's field, not as a subclass
In lazy dataset, the buffer_size determines how much of the data is being pre-fetched. In non-lazy dataset, this is always the size of the data.
Shuffling can be done both on lazy and non-lazy datasets (lazy dataset will shuffle the buffer)
buffer_size should be bigger than batch_size to avoid a warning.
Writer objects are introduced. A special writer is neuralmonkey.writers.auto.AutoWriter, which preserves the functionality from the old codebase, which automatically selects the suitable writer given the type of the data being outputted.
The old load_dataset_from_files function is re-written to work with new dataset to ensure backward compatibility, but a deprecation notice is logged whenever it is used.
Small refactor was done regarding the sharing of the dataset batching. Now the batching is done exclusively in learning_utils, not in the TF manager's execute method.
Dataset series are now iterators, with the dataset.get_series() method returning a fresh iterator every time is done. Internally, the dataset stores factory functions which are called to create a new data iterator for each series.
A tiny ugly aspect of this is the type resolution done on the sources argument. Originally, the <preprocessor> would follow the "source" in the tuple, but this would clash with how the file readers are specified (Tuple[str, Callable]) so the series-level preprocessor had to become a Tuple[Callable, str]

varisd commented 6 years ago

Why is the (<preprocessor>, "source") in this order and not ("source", <preprocessor>)? I think that it would be clearer to have ('input_file/input_series', 'processor/reader')

jindrahelcl commented 6 years ago

Because both reader and processor are Callables and their types can overlap. (Readers are Callable[[List[str]], Iterator[Any]], and preprocessors are Callable[[List[Any], List[Any]], so there is overlap, for example for Callable[[List[str]], List[str]].)

When you want to recognize which is which according to its type (without having a parent class for Reader and Preprocessor), you need to switch the ordering, otherwise they are both Tuple[str, Callable].

jindrahelcl commented 6 years ago

... and you don't want to start inheriting these things, because it's not that simple because all the invariance/covariance and whatnot.

jindrahelcl commented 6 years ago

I fixed the errors and removed the redundant lazy parameter. Now, everytime the buffer_size is specified, the dataset will behave lazily (předtim to spadlo, když lazy nebylo true a naopak když lazy bylo true a nebyl specifikovanej buffer size)

jindrahelcl commented 6 years ago

Note that the data are actually stored in memory and are not re-read from the files, if the buffer_size is None. This is the expected behavior.

ufal / neuralmonkey

Dataset refactor #754