FE suggestions for preprocessing components

GoogleCodeExporter commented 9 years ago

There is currently a steep learning curve when a TC user wants to try a new 
Feature Extractor that requires preprocessing.  The FE's Type Capability:
@TypeCapability(inputs = { 
"de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency" })

explains what type of preprocessing is needed, but currently the user must look 
through the online javadoc of DKPro Core preprocessing components, searching 
for a component that outputs the needed input of the FE.

It would be helpful if TC FE's included preprocessing component suggestions in 
their javadoc, such as:

A sample preprocessing EngineDescription for this FE includes:
de.tudarmstadt.ukp.dkpro.core.tokit.BreakIteratorSegmenter
de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpPosTagger
de.tudarmstadt.ukp.dkpro.core.maltparser.MaltParser

Alternative documentation ideas are also welcome.

Original issue reported on code.google.com by EmilyKJa...@gmail.com on 11 Jun 2014 at 5:21

Blocked on: #121

GoogleCodeExporter commented 9 years ago

Original comment by daxenber...@gmail.com on 11 Jun 2014 at 5:25

Now blocked on: #121

GoogleCodeExporter commented 9 years ago

I see the problem, but I do not think that FEs should declare specific 
components as requirements.

For the short-term, some documentation would be useful. 

For the medium term, it would be good if TC could scan the classpath for 
components that produce the desired annotations and to suggest them to the user 
(e.g. in an error message).

Let's discuss the long term offline.

Original comment by richard.eckart on 11 Jun 2014 at 6:02

GoogleCodeExporter commented 9 years ago

the kind of documentation that Emily suggests is essential - becomes even more 
important, if we start to use lexical resources during feature extraction.

Original comment by eckle.kohler on 11 Jun 2014 at 6:15

GoogleCodeExporter commented 9 years ago

The best we currently have is this:

https://code.google.com/p/dkpro-core-asl/wiki/ComponentList

Original comment by richard.eckart on 11 Jun 2014 at 6:17

GoogleCodeExporter commented 9 years ago

I agree with Richard that TC is probably not the place to document examples of 
components that might get outdated anyway.
I see that as an instance of the more general problem that also students have 
when starting to use DKPro Core that it is not very clear which components 
create which annotations.

Original comment by torsten....@gmail.com on 11 Jun 2014 at 7:54

GoogleCodeExporter commented 9 years ago

Regarding documentation and whether it should be distributed across TC FE's or 
centralized in, say, a chart in the google Wiki of DKPro Core:

In the case that DKPro Core components change and that a particular 
preprocessing component that a FE relies on could be removed from DKPro Core, 
rendering the FE unusable, it might take longer to track down the unusability 
of the FE with centralized documentation versus distributed documentation.  
With distributed documentation, I could open the FE, see the suggested 
preprocessing, look at an error message or Core and notice that one component 
is no longer supported, realize that there is no other component that will 
work, and flag the FE as buggy.  If documentation is only centralized, I might 
think, there's got to be some combination of components that works but I can't 
see what it is...

Other benefits of distributed documentation:
-When a developer adds a FE to TC, should they also have to be a developer on 
Core so they can update the centralized documentation there for the TC FE?  
-Should Core be required to host documentation specific to the needs of TC FE's?
-Some FE's have trivial preprocessing needs such as Tokenization, but as time 
passes, we are seeing a wonderfully diverse library of FE's contributed to TC; 
for some FE with preprocessing needs like semantic dependency parsing of NE's, 
won't the explanation of a necessary preprocessing pipeline be too unwieldy to 
centralize?

Of course, distributed documentation has the drawback that it is more effort to 
keep updated.

Original comment by EmilyKJa...@gmail.com on 11 Jun 2014 at 10:41

GoogleCodeExporter commented 9 years ago

I cannot follow your argumentation. I don't understand what FE-specific 
documentation would be kept in DKPro Core.

We have two questions:

a) DKPro Core documentation should be able to answer the question "which 
component produces annotation type X".

b) DKPro TC documentation should answer the question "which annotation type is 
required by FE X".

It appears to me to be a pretty clean separation of concerns. 

I thought that one goal of TC was to remove the need for the user to answer 
these questions by automatically adding the required preprocessing when an FE 
is added. 

So to start with, having answers for both questions cleanly separated in Core 
and TC makes sense to me. Eventually, though, it would be nice if work on the 
mail goal could proceed: removing the need to answer these questions.

Your initial suggestion to let the FE JavaDoc suggest components directly goes 
into that direction. But that's again just documentation. How about defining an 
automatic solution. A simplistic start could be a configuration file in DKPro 
TC that maps each type to an analysis engine, e.g.

...Sentence=...OpenNlpSegmenter
...Token=...OpenNlpSegmenter
...Dependency=...MaltParser

TC could use this information to add the respective analysis engines to the 
preprocessing step. I'm sure you can see immediately, that there are pitfalls 
in this process. We'd need to see how far we can get with such a simple 
solution before banging our heads against the wall. In the worst case, TC could 
use this information simply to construct an error message to display to the 
user.

Original comment by richard.eckart on 11 Jun 2014 at 10:56

GoogleCodeExporter commented 9 years ago

Thanks Richard, for your helpful points in multiple directions.

In my previous post, I overlooked the fact that, while a particular FE's 
*combination* of preprocessing types may be unique, each type must be 
pre-existing in DKPro Core unless a Core developer adds a new one.  So you're 
right, there's no need to update Core documentation for each new TC FE.

I think this Issue has outgrown itself, so I am closing it for now and adding 
an item to the TC Meeting agenda for further discussion.

Original comment by EmilyKJa...@gmail.com on 11 Jun 2014 at 11:33

Changed state: WontFix

GoogleCodeExporter commented 9 years ago

Keeping the discussion alive for now, as I like the points Richard has raised 
about ease of use for TC users.
It would be great if there was no need for users to specify which preprocessing 
components to use. There should be a sensible default and the possibility to 
override the default.

Unfortunately, the Type2Component mapping is rather language-specific (e.g. 
specialised parsers for some languages), but of course there are components 
that just support a wider range of languages and would make a good default 
(e.g. Stanford).

As Richard said, an error message could be generated if the defaults yield no 
usable result for some input.

Original comment by torsten....@gmail.com on 12 Jun 2014 at 3:31

google-code-export / dkpro-tc

FE suggestions for preprocessing components #144