ufal / neuralmonkey

An open-source tool for sequence learning in NLP built on TensorFlow.
BSD 3-Clause "New" or "Revised" License
410 stars 102 forks source link

Dataset refactor #754

Closed jindrahelcl closed 6 years ago

jindrahelcl commented 6 years ago

This PR introduces a new way how to construct a dataset:

[train_data]
series=["source", "target", "source_prep", "computed"]
data=["/path/src", ("/path/tgt", <reader>), (<preprocessor>, "source"), <dataset_level_prep>]
outputs=[("target", "/path/out/tgt"), ("computed", "/path/out/cmp", <writer>)]
shuffled=True
buffer_size=1024

Details:

varisd commented 6 years ago

Why is the (<preprocessor>, "source") in this order and not ("source", <preprocessor>)? I think that it would be clearer to have ('input_file/input_series', 'processor/reader')

jindrahelcl commented 6 years ago

Because both reader and processor are Callables and their types can overlap. (Readers are Callable[[List[str]], Iterator[Any]], and preprocessors are Callable[[List[Any], List[Any]], so there is overlap, for example for Callable[[List[str]], List[str]].)

When you want to recognize which is which according to its type (without having a parent class for Reader and Preprocessor), you need to switch the ordering, otherwise they are both Tuple[str, Callable].

jindrahelcl commented 6 years ago

... and you don't want to start inheriting these things, because it's not that simple because all the invariance/covariance and whatnot.

jindrahelcl commented 6 years ago

I fixed the errors and removed the redundant lazy parameter. Now, everytime the buffer_size is specified, the dataset will behave lazily (předtim to spadlo, když lazy nebylo true a naopak když lazy bylo true a nebyl specifikovanej buffer size)

jindrahelcl commented 6 years ago

Note that the data are actually stored in memory and are not re-read from the files, if the buffer_size is None. This is the expected behavior.