Closed GoogleCodeExporter closed 9 years ago
The simplest case of annotating text with Uby information is to annotate tokens
(based on their lemmas).
There is a lot that can be annotated at the token level. If you just consider
semantic tags, a wide variety of different "semantic tagsets" can be derived
from Uby and used for tagging.
Therefore, my impression was that it might be useful to keep information of the
specific "semantic Uby tagset" used for tagging.
>> However, Uby will probably be only one possible data source for such
information.
Sure, the information that is annotated is not Uby-specific at all. I just
mention Uby here, because it is the only lexical resource I am working with
(quite ok, since it contains 10 lexical resources ...)
So there is no need to mention Uby anywhere in the type names.
>> You want Uby specific stuff.
Actually, I can not think of any Uby-specific stuff to annotate. All that Uby
provides is ordinary lexical information, but at a scale that is typically not
reachable by single lexical resources.
Original comment by eckle.kohler
on 26 Jun 2013 at 11:33
Would this require to disambiguate first?
I guess that semantic tags are quite specific to senses.
Original comment by torsten....@gmail.com
on 26 Jun 2013 at 11:39
That depends on the specific semantic tagset used for annotating.
There are cases where disambiguation is not necessary or very simple.
For other semantic tags, the annotator might have to perform some kind of WSD.
Original comment by eckle.kohler
on 26 Jun 2013 at 11:51
Great. I am looking forward to the prototype.
Original comment by torsten....@gmail.com
on 26 Jun 2013 at 11:52
I wonder how we'll do the interfacing between DKPro Core and Uby:
a) have a "uby" module in DKPro Core with a couple of annotators
b) have a "uima" module in Uby with a couple of annotators
c) define resource APIs (e.g. "Dictionary") and generic annotators (e.g.
"DictionaryAnnotator)" in DKPro Core and provide implementations of that in Uby.
I think "c" would definitely be the coolest one.
Original comment by richard.eckart
on 30 Jun 2013 at 5:12
I also like c) as it aligns best with the "Uby is a excellent source for
information xyz, but certainly not the only one" paradigm discussed above.
Original comment by torsten....@gmail.com
on 30 Jun 2013 at 5:57
c) +1
BTW: does this still fit with a UbyResourceLocator in uby? (which is living
there already in a uima module created today)
Original comment by eckle.kohler
on 30 Jun 2013 at 6:00
Sure, why not. I imagine for somebody wanting to code a custom component (not
resource) using Uby, the locator should be convenient.
At this point, I couldn't say it would be more convenient if a hypothetical
"UbyDictionary" would use it or if it would have its own internal Uby instance.
Original comment by richard.eckart
on 30 Jun 2013 at 6:07
I have a couple of questions and remarks regarding the DKPro-Core part of the
UBY-Core Interface:
- as a name for the generic interface I would prefer SemanticLabelProvider
instead of Dictionary. I see many similarities to the FrequencyCountProvider in
DKPro-Core, whereas Dictionary seems to be too focussed on the use of
dictionaries in my opinion.
This interface would define a method
String getSemanticLabel(String lemma, String POS, String semanticLabelType)
These parameters are actually necessary to implement a generic interface which
can also be implemented by a UbySemanticLabelProvider.
Regarding the Dictionary interface in decompounding, I have a number of
questions and comments that might be discussed elsewhere.
- Is it necessary to implement the UbySemanticLabelProvider as a UIMA resource,
i.e. subclassing Resource_ImplBase in uimaFIT? The FrequencyCountProvider seems
not to be implemented this way.
- I definitely need an annotation type such as SemanticLabel or
SemanticCategory with two features, namely
type (type of the semantic label/category) and
value (type of the semantic label/category).
SemanticLabel might sound too UBY specific. However, the type would be very
general:
Examples:
type=semanticField, value=location, person, ...
type=domain, value=Computer, Education, Chemistry, ...
I tried to motivate that already in this discussion:
https://groups.google.com/forum/#!searchin/dkpro-core-developers/uby/dkpro-core-
developers/_eCGNb8bUdE/gvV3loucYpAJ
but within this discussion, a kind of misunderstanding occurred.
The new annotation type I need would be quite general and not UBY-specific and
not at all related to the Types which are already available for Named Entities.
A UbySemanticLabelAnnotator will annotate the following word classes with a
semantic category or label: common nouns, main verbs, adjectives.
It will not annotate any proper nouns.
I could also introduce such an annotation type in Uby. But that might be a
first step to a parallel type system.
Best
Judith
Original comment by eckle.kohler
on 28 Jul 2013 at 8:00
Regarding a new annotation type for semantic field information from WordNet:
This kind of lexical information is actually well established in papers that
use lexical resources for IE or Text Classification.
However, they are called differently in the literature:
- WordNet lexicographer file names (the very literal name of these tags)
- supersenses, supersense tagging
- semantic fields
I searched on the ACL anthology workbench to get some evidence:
http://aclasb.dfki.de/#txt~p|WordNet%20supersense* (17 hits)
http://aclasb.dfki.de/#txt~p|WordNet%20semantic%20field*doc~W04-0813*
They use semantic field features as well:
Dirk Hovy, Shashank Shrivastava, Sujay Kumar Jauhar, Mrinmaya Sachan, Kartik
Goyal, Huying Li, Whitney Sanders and Eduard Hovy: Identifying Metaphorical
Word Use with Tree Kernels. NAACL HLT Meta4NLP Workshop, 2013.
I used this annotation too (extensively) in recent research (with good results).
So a type SemanticField with a "value" feature might be something worth
considering.
Judith
Original comment by eckle.kohler
on 31 Jul 2013 at 7:34
Here is my plan:
- create a new package dictionaryannotator.semantictagging in the module
dictionaryannotator-asl
- add to this new package: an Interface SemanticTagProvider, a UIMA resource
SimpleSemanticTagProvider and an annotator SimpleSemanticTagAnnotator that uses
a key-value map as resource (retrieved from a file). The annotator will use the
Named entity type for now or another generic one.
- add test cases for the SimpleSemanticTagAnnotator
The other side of the interface will go to UBY:
- create a new module uby.core-asl
- add resources that inherit from Resource_ImplBase and implement the
SemanticTagProvider: a UbySemanticFieldProvider, UbySemanticFrameProvider,
UbyDomainProvider
- add the corresponding annotators that annotate tokens (phrases will be
considered later) with these tags
(I will use existing annotation types for now)
any objections?
Original comment by eckle.kohler
on 2 Aug 2013 at 1:31
For the first shot, I'd suggest to keep all of the stuff in one module, either
on the Uby or on the DKPro Core side. I'd suggest dumping it into the
dictionaryannotator module right now. Moving code around to better places
and/or renaming can be done when it works.
Original comment by richard.eckart
on 2 Aug 2013 at 1:35
I finished the first round and implemented
- SemanticTagProvider (Interface)
- NounSemanticFieldResource
- NounSemanticFieldAnnotator
and a test class for the annotator:
- NounSemanticFieldAnnotatorTest along with a tiny test resource
nounSemanticFieldMapTest.txt
In the test class I use the AssertAnnotations.assertNamedEntity convenience
method from testing-asl. However, my test turned only green, when I added a
modified version of assertNamedEntity without the param. aExpectedMapped.
In my case, there is no mapping between original and DKPro-Core NE values/types.
The method I added looks like this:
public static void assertNamedEntity(String[] aExpectedOriginal,
Collection<NamedEntity> aActual)
Isn't there a way to use the original method
assertNamedEntity(String[] aExpectedMapped, String[] aExpectedOriginal,
Collection<NamedEntity> aActual)
in a way that does not assume a mapping? I tried several versions with
aExpectedMapped and aExpectedOriginal set to the same String[], but it did not
work.
Otherwise, can I add the
public static void assertNamedEntity(String[] aExpectedOriginal,
Collection<NamedEntity> aActual)
to AssertAnnotations?
Judith
Original comment by eckle.kohler
on 4 Aug 2013 at 11:48
Did you try using passing "null" as aExpectedMapped? Looking at the method, it
should ignore that argument if it is null.
Original comment by richard.eckart
on 4 Aug 2013 at 11:52
yes, I did and it does not work:
AssertAnnotations.assertNamedEntity(null,documentNounSemanticFields,
select(aJCas, NamedEntity.class));
yields
java.lang.NullPointerException
at java.util.Arrays$ArrayList.<init>(Arrays.java:2842)
at java.util.Arrays.asList(Arrays.java:2828)
at de.tudarmstadt.ukp.dkpro.core.testing.AssertAnnotations.assertNamedEntity(AssertAnnotations.java:199)
at de.tudarmstadt.ukp.dkpro.core.dictionaryannotator.semantictagging.NounSemanticFieldAnnotatorTest.runAnnotatorTest(NounSemanticFieldAnnotatorTest.java:109)
at de.tudarmstadt.ukp.dkpro.core.dictionaryannotator.semantictagging.NounSemanticFieldAnnotatorTest.testGermanSeparatedParticles(NounSemanticFieldAnnotatorTest.java:37)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
Original comment by eckle.kohler
on 4 Aug 2013 at 9:30
I've fixed the NPE in assertNamedEntity for your case.
Original comment by richard.eckart
on 5 Aug 2013 at 8:47
Thanks for fixing the assertNamedEntity, Richard.
I have a question regarding the key/value resource file that contains the noun
lemmas and their WordNet semantic field. Where should this resource go? Are
there any naming conventions for such files?
The size of the file is 2,3 MB
Original comment by eckle.kohler
on 5 Aug 2013 at 8:05
I thought the idea was to access the Uby database directly?
Otherwise, I suppose this would be a resources to be packaged as a JAR file and
to go into the Maven repository.
Original comment by richard.eckart
on 5 Aug 2013 at 8:15
>> I thought the idea was to access the Uby database directly?
right, this is the idea.
The file resource with the WordNet semantic fields just turned out to be very
useful and broadly applicable, so I extracted this information into a file for
efficiency reasons.
And thought other people might be interested in using it as well, because it
does not require to install a database.
Now I will implement 2 UBY specific pairs of resources and annotators:
- UbySemanticPredicateResource and UbySemanticPredicateAnnotator (will use the
type SemanticPredicate)
- UbyDomainLabelResource and UbyDomainLabelAnnotator (will use the type field
from api.structure)
These will access the UBY DB directly and also exploit the sense links in
particular ways.
Original comment by eckle.kohler
on 6 Aug 2013 at 3:36
So currently, we have these build.xml files which download resources from their
original websites, package them, and upload them to our Maven repository. If
there is no "original website" for a resource, e.g. for your list, we so far
host them in the downloads section of the DKPro Core ASL google project (which
will go away soon, so some different hosting location will be required).
Original comment by richard.eckart
on 6 Aug 2013 at 8:44
For the UBY specific resources I need to create a mapping between
- Core POS tags and UBY POS tags
- Core language information (ISO 2-letter code) and UBY language information
(ISO 3-letter code)
Is it sensible to assume that for all the POS taggers integrated in DKPro-Core
(English and German), a mapping exists that maps the original POS tags to Core
POS types?
Original comment by eckle.kohler
on 10 Aug 2013 at 5:12
German POS models usually use STTS and English POS models usually use PTB. Both
are mapped.
Are the UBY POS tags language specific?
Original comment by richard.eckart
on 10 Aug 2013 at 5:15
>> German POS models usually use STTS and English POS models usually use PTB.
Both are mapped.
fine.
>> Are the UBY POS tags language specific?
No, they are designed to be language-independent.
But a Uby-specific resource that implements the getSemanticTag method needs POS
and lemma information to access the lexical entry.
And the language information to pre-select the Uby lexicon to use.
This is important in order to throw appropriate exceptions that inform the user
if e.g. the German lexicon GermaNet is missing in UBY.
Original comment by eckle.kohler
on 10 Aug 2013 at 5:28
Issue 169. Commited UbySemanticFieldResource, UbySemanticFieldAnnotator and
UbyResourceUtils
The test class UbySemanticFieldAnnotatorTest successfully runs a test on a real
(MySQL) DB, therefore the test method is ignored.
A suitable test case for an in-memory UBY DB should be added.
Original comment by eckle.kohler
on 12 Aug 2013 at 8:40
test case for an in-memory UBY DB was added.
see http://code.google.com/p/dkpro-core-asl/source/detail?r=1791
Original comment by eckle.kohler
on 18 Aug 2013 at 1:42
Original comment by richard.eckart
on 12 Sep 2013 at 7:59
I think the NounSemanticFieldAnnotator and the NounSemanticFieldAnnotatorTest
can be removed.
Additional parameters that could be added to the SemanticFieldAnnotator:
- maybe language (?)
- token vs. phrase annotation
Original comment by eckle.kohler
on 14 Sep 2013 at 8:40
Original comment by richard.eckart
on 17 Sep 2013 at 2:42
Original comment by richard.eckart
on 26 Mar 2014 at 10:51
I believe we do now have implementations of the ideas presented here on the
sides of DKPro Core in the dictionaryannotator module and on the side of Uby in
the form of resources that can be used with the dictionaryannotator code,
right? If so, this could be resolved.
Original comment by richard.eckart
on 26 May 2014 at 10:17
Separate issues could be opened for specific extensions, e.g. for passing the
language through.
Original comment by richard.eckart
on 26 May 2014 at 10:18
>>I believe we do now have implementations of the ideas presented here on the
sides of DKPro Core in the dictionaryannotator module and on the side of Uby in
the form of resources that can be used with the dictionaryannotator code, right?
Actually, this issue should be closed as won't fix.
Another issue could be opened titled "Tag text with information from
wordlists". And this issue can be marked as resolved.
The resource AND annotators that tag text with information from Uby have been
moved to Uby. The reason for this was the fact that Uby is not yet on Maven
Central.
>> Separate issues could be opened for specific extensions, e.g. for passing
the language through.
Right.
Another extension would be to tag not only tokens, but also noun chunks.
I already have implemented that. But would need help in setting up the test
case, because last time I could not find out how chunks are composed/built in a
test case.
Original comment by eckle.kohler
on 27 May 2014 at 6:50
Renaming and closing as fixed.
Original comment by richard.eckart
on 27 May 2014 at 8:08
Original issue reported on code.google.com by
torsten....@gmail.com
on 26 Jun 2013 at 10:46