statsmaths / cleanNLP

R package providing annotators and a normalized data model for natural language processing
GNU Lesser General Public License v2.1
209 stars 36 forks source link

German lemmas incorrect #20

Closed uniquer4ven closed 7 years ago

uniquer4ven commented 7 years ago

Hi, I just tried spaCy with R, applying it to a German text (after initializing it for German language). Everything went fine, including the POS tagging, which would not work, I guess, if the initialization failed, BUT: all the lemmas are as the words, declination is always as in the original text. Do you have any solutions?

katzeHut<-c("Die Sonne schien nicht.", "Es war zu nass um zu spielen.", "Also saßen wir im Hause den ganzen bitterkalten, nassen Tag lang.")
obj <- run_annotators(katzeHut, as_strings = TRUE)
get_token(obj)
# A tibble: 26 × 8
      id   sid   tid   word  lemma  upos    pos   cid
   <int> <int> <int>  <chr>  <chr> <chr>  <chr> <int>
1      1     1     1    Die    die   DET    ART     0
2      1     1     2  Sonne  sonne  NOUN     NN     4
3      1     1     3 schien schien  VERB  VVFIN    10
4      1     1     4  nicht  nicht  PART PTKNEG    17
5      1     1     5      .      . PUNCT     $.    22
6      2     1     1     Es     es  PRON   PPER     0
7      2     1     2    war    war   AUX  VAFIN     3
8      2     1     3     zu     zu  PART   PTKA     7
9      2     1     4   nass   nass   ADJ   ADJD    10
10     2     1     5     um     um   ADP   APPR    15
# ... with 16 more rows
statsmaths commented 7 years ago

This seems to be an issue with spaCy, as the following in python does not do lemmatisation either:

import spacy

de = spacy.load('de')
doc = de(u'Die Sonne schien nicht.')
x = next(doc.sents)

for word in x:
    print(word.lemma_)

Have you tried / had this work on the Python side?

uniquer4ven commented 7 years ago

I did no try it, but I guess if you do not get it working, I won’t either. I am not sure how to proceed from here, will there be someone acting on this problem anyhow?

From: Taylor Arnold [mailto:notifications@github.com] Sent: Dienstag, 12. September 2017 00:27 To: statsmaths/cleanNLP cleanNLP@noreply.github.com Cc: uniquer4ven r4ven@gmx.de; Author author@noreply.github.com Subject: Re: [statsmaths/cleanNLP] German lemmas incorrect (#20)

This seems to be an issue with spaCy, as the following in python does not do lemmatisation either:

import spacy

de = spacy.load('de') doc = de(u'Die Sonne schien nicht.') x = next(doc.sents)

for word in x: print(word.lemma_)

Have you tried / had this work on the Python side?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/statsmaths/cleanNLP/issues/20#issuecomment-328676699 , or mute the thread https://github.com/notifications/unsubscribe-auth/AeKEu-kUHvnOh4JdngqPfSPYuc4jGU5Mks5shbO0gaJpZM4PL4ur .

statsmaths commented 7 years ago

Well, this repository is just for the R wrapper for spaCy. For core work on the package itself, you should post on that repository directly. They are rapidly working on a number of updates, so my best guess is that this will be taken care of as an enhancement sooner rather than later.