google-code-export / dkpro-tc

Automatically exported from code.google.com/p/dkpro-tc
Other
1 stars 0 forks source link

ModalVerbsFeatureExtractor for German #58

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
It would be very nice to have a ModalVerbsFeatureExtractor for German as well.

The actual modal verbs that are looked up in this FE extractor could be passed 
as a wordlist parameter. So the ModalVerbsFeatureExtractor would be easier to 
use across languages.

Original issue reported on code.google.com by eckle.kohler on 10 Nov 2013 at 1:17

GoogleCodeExporter commented 9 years ago
Wouldn't that be something like a DictionaryFeatureExtractor?

Original comment by richard.eckart on 10 Nov 2013 at 1:26

GoogleCodeExporter commented 9 years ago
There's the 
de.tudarmstadt.ukp.dkpro.tc.features.content.TopicWordsFeatureExtractor
which is basically a simple dictionary feature extractor (and also should be 
renamed, accordingly, btw)

It adds a hard coded prefix to each feature - I suggest we make this prefix 
configurable, so you could set it to "Modal_" and can then identify Modal 
features later on easily.
The mechanics of the extractor are pretty much the same for these cases.

Original comment by oliver.ferschke on 10 Nov 2013 at 1:30

GoogleCodeExporter commented 9 years ago
It also depends on what aspects of modal verbs you want to capture:

count in document?
text to modal verb ratio?
simple presence of any modal verbs in the text?
other?

These scenarios could also be implemented in a generic way for all use cases 
based on dictionaries...

Original comment by oliver.ferschke on 10 Nov 2013 at 1:34

GoogleCodeExporter commented 9 years ago
yes, I think it would be something *similar* like that:
the generalized version - a DictionaryFeatureExtractor - would count 
occurrences of items from a Dictionary (e.g. wordlist) AND it would aggregate 
these counts:

so in the modal verbs example, not only the individual modal verbs (e.g. must, 
should) are counted, but also "modals"

so it's more like a "DictionaryWordClassFeatureExtractor"

Original comment by eckle.kohler on 10 Nov 2013 at 1:35

GoogleCodeExporter commented 9 years ago
Another option would be to rely on POS tagging instead of a dictionary.
STTS has specific categories for modals.
I see two advantages: (i) you don't need to add all forms, and (ii) you don't 
wrongly count surface forms that are not used as a model in a certain context.

Original comment by torsten....@gmail.com on 10 Nov 2013 at 5:19

GoogleCodeExporter commented 9 years ago
>>Another option would be to rely on POS tagging instead of a dictionary.
>>STTS has specific categories for modals.

I agree that this would in theory be preferable over word forms - however, only 
if the POS tagger is able to tag modal verbs accurately. This would have to be 
looked into. From my past experience with the STTS tagset / TreeTagger, I 
recall that some of these smaller word classes are tagged wrongly and therefore 
counting the lexical items was less noisy.

Original comment by eckle.kohler on 11 Nov 2013 at 6:35

GoogleCodeExporter commented 9 years ago
@Judith: do you still have plans to solve this issue?

Original comment by daxenber...@gmail.com on 4 Jun 2014 at 11:55

GoogleCodeExporter commented 9 years ago
yes, but not now - I need to first get an overview of the current state of TC 
which I will do after the upcoming release

can you move it to milestone after the upcoming release, please

Original comment by eckle.kohler on 4 Jun 2014 at 12:08

GoogleCodeExporter commented 9 years ago

Original comment by daxenber...@gmail.com on 4 Jun 2014 at 12:32

GoogleCodeExporter commented 9 years ago

Original comment by daxenber...@gmail.com on 29 Aug 2014 at 10:50

GoogleCodeExporter commented 9 years ago
In order to determine word difficulty, I added some functions to determine 
adjective endings, help verbs, modal verbs and auxiliary words for English, 
German and French to de.tudarmstadt.ukp.dkpro.tc.features.readability.util. I 
then noticed the AdjectiveEndingFeatureExtractor and the 
ModalVerbsFeatureExtractor for English and this discussion.  
I also added WordListExtractors that check if a word occurs in a list. 

In both cases, I am not yet very happy with the solution, but maybe they can 
revive this discussion to generalize FEs to other languages.  

Original comment by lisa.bei...@gmail.com on 12 Mar 2015 at 11:29