ufal / neuralmonkey

An open-source tool for sequence learning in NLP built on TensorFlow.
BSD 3-Clause "New" or "Revised" License
410 stars 104 forks source link

Chaining of dataset series preprocessor can fail. #767

Open varisd opened 5 years ago

varisd commented 5 years ago

When chaining multiple dataset series preprocessor steps, e.g.: preprocessors=[("source", "source_wp", ), ("source_wp", "source_wp_other", )]

The dataset.load can fail because there is no implicit order of processing the preprocessors list.

See: https://github.com/ufal/neuralmonkey/blob/master/neuralmonkey/dataset.py#L325

For the part of code, that should be fixed.

jindrahelcl commented 5 years ago

podívej se na komentář o dvě řádky nad tim, na co odkazuješ. Správně by se měl používat pipeline processor. může se sem přidat nějaký stromový zpracování, ale to nefungovalo ani ve starým datasetu

varisd commented 5 years ago

Here is a suggestion:


   def _add_preprocessed_series(iterators, s_name, prep_sl):
       preprocessor, source = prep_sl[s_name]
       if s_name in iterators:
           return
       if source in prep_sl:
           _add_preprocessed_series(iterators, source, prep_sl)
       if source not in iterators:
           raise ValueError(
           "Source series {} for series-level preprocessor nonexistent: "
               "Preprocessed series '', source series ''".format(source))
       iterators[s_name] = _make_sl_iterator(source, preprocessor)
[...]
   for s_name in prep_sl:
       _add_preprocessed_series(iterators, s_name, prep_sl)