qe-team / marmot

MARMOT - the open source framework for feature extraction and machine learning, designed to estimate the quality of Machine Translation output
ISC License
21 stars 7 forks source link

parsers should return data, not filenames. parsers should not create representations, they should just parse them. #13

Open chrishokamp opened 9 years ago

chrishokamp commented 9 years ago

right now, some of our parsers return context objects, and some of them return the filenames of implicitly whitespace-tokenized files. All parsers should take filenames as input, and return lists of lists as output.

It should be the responsibility of the representation generator to (1) generate the representation (2) persist or not persist the representation. The only job of a parser is to read some file format, and to return an object with { 'key': [[seq1_item1, seq1_item2, ...]]}

chrishokamp commented 9 years ago

we seem to be moving towards using representation generators for everything. the disadvantage here is that the user must then call create_contexts after generating their representations. With the parser approach they could go directly from data --> context objects.

varvara-l commented 9 years ago

Going directly from data to context objects is possible only if we don't need any additional representations. But we can create parsers to handle such scenario as well.

chrishokamp commented 9 years ago

we can imagine a usecase where a user just wants to use a feature extractor on a dataset and get the features dumped back out. what is the simplest way for them to specify this in the config?

varvara-l commented 9 years ago

It can be specified in "datasets" in the same way as representation generators are specified now.

The main thing is then handle that in the code as well: if parsers go directly to context objects, there should be no representation generators applied to the output of parsers and no calling of create_contexts function.

chrishokamp commented 9 years ago

yeah i think run_experiment really only handles one usecase right now. It may be easier to create more scripts like 'extract_features' instead of trying to handle every possible usecase inside one script.

On Fri, Feb 20, 2015 at 1:52 PM, varvara-l notifications@github.com wrote:

It can be specified in "datasets" in the same way as representation generators are specified now.

The main thing is then handle that in the code as well: if parsers go directly to context objects, there should be no representation generators applied to the output of parsers and no calling of create_contexts function.

— Reply to this email directly or view it on GitHub https://github.com/qe-team/marmot/issues/13#issuecomment-75240604.