wartaal / HanTa

The Hanover Tagger - A simple approach to lemmatization and POS-tagging of German morphology based on heuristics and hidden markov models
GNU Lesser General Public License v3.0
47 stars 2 forks source link

Key Error while running with function tag_sent_viterbi(self, sent, casesensitive) #2

Closed thaitrinh closed 3 years ago

thaitrinh commented 3 years ago

Hello,

I am getting this Key Error while using HanTa package: ~\Anaconda3\envs\TestHanTa\lib\site-packages\HanTa\HanoverTagger.py in tag_sent_viterbi(self, sent, casesensitive) 229 if lp0 < rowbound: 230 continue --> 231 lp_t = self.LP_trans_word[prev].items() 232 for c, lp_tc in lp_t: 233 if c not in wprobs and len(wprobs) > 0:

KeyError: State(p2='$.', p1='<END>')

Has anyone experienced similar issue? What is your solution then? Thank you very much!

wartaal commented 3 years ago

Dear Thaitrinh,

with which values are you calling this function, and why are you calling this function? This function is not intended to be used by applications but for internal use

Best CHristian

thaitrinh commented 3 years ago

Hi Christian,

Thank you very much for your reply! I actually call the function tagger.tag_sent(my_sentence). Below is the error in more detailed:

from HanTa import HanoverTagger as ht
tagger = ht.HanoverTagger('morphmodel_ger.pgz')
...
---> 12     res = [x[1] for x in tagger.tag_sent(sent)]
     13     return res
     14 

~\Anaconda3\envs\va\lib\site-packages\HanTa\HanoverTagger.py in tag_sent(self, sent, taglevel, casesensitive)
    386 
    387     def tag_sent(self, sent, taglevel=1, casesensitive = True):
--> 388         tags = self.tag_sent_viterbi(sent,casesensitive)
    389         if taglevel == 0:
    390             return tags

~\Anaconda3\envs\va\lib\site-packages\HanTa\HanoverTagger.py in tag_sent_viterbi(self, sent, casesensitive)
    229                 if lp0 < rowbound:
    230                     continue
--> 231                 lp_t = self.LP_trans_word[prev].items()
    232                 for c, lp_tc in lp_t:
    233                     if c not in wprobs and len(wprobs) > 0:

KeyError: State(p2='$.', p1='<END>')

So I guess it has something to do with the LP_trans_word dictionary, right?

Thanks and best regards, Thai

wartaal commented 3 years ago

This seems indeed to be a real error, but I don't yet understand it. Could you also give the sentence, or does the error occur with every sentence? I have to be able to reproduce the error before I can search for a solution.

DaryLee commented 3 years ago

sadly i am getting the same error State(p2='NN', p1='')

i am having a cleaned text from a csv . i tokenize it by sentences inside the cells and the tagger breaks on following sentence: "Hier sind auch die negative Reviews aus dem Anhang Filmkritik möglich (z.B. auch noch nicht bewertete Filme)."

with regex i extract just the words and split by whitespace

wartaal commented 3 years ago

Hi DaryLee,

thanks for the example. However, still I could not reproduce the error. I guess there is some starneg invisible character in the text or there is a problem wit the text encosing. I tried the following:

from  HanTa import HanoverTagger as ht
import nltk
tagger = ht.HanoverTagger('morphmodel_ger.pgz')
words = nltk.word_tokenize('Hier sind auch die negative Reviews aus dem Anhang Filmkritik möglich (z.B. auch noch nicht bewertete Filme).')

Now words has the following value:

['Hier',
 'sind',
 'auch',
 'die',
 'negative',
 'Reviews',
 'aus',
 'dem',
 'Anhang',
 'Filmkritik',
 'möglich',
 '(',
 'z.B',
 '.',
 'auch',
 'noch',
 'nicht',
 'bewertete',
 'Filme',
 ')',
 '.']

Now I run:

tagger.tag_sent(words)

which gives me:

[('Hier', 'hier', 'ADV'),
 ('sind', 'sein', 'VAFIN'),
 ('auch', 'auch', 'ADV'),
 ('die', 'die', 'ART'),
 ('negative', 'negativ', 'ADJA'),
 ('Reviews', 'Review', 'NE'),
 ('aus', 'aus', 'APPR'),
 ('dem', 'dem', 'ART'),
 ('Anhang', 'Anhang', 'NN'),
 ('Filmkritik', 'Filmkritik', 'NN'),
 ('möglich', 'möglich', 'ADJD'),
 ('(', '--', '$('),
 ('z.B', 'z.b', 'ADJD'),
 ('.', '--', '$.'),
 ('auch', 'auch', 'ADV'),
 ('noch', 'noch', 'ADV'),
 ('nicht', 'nicht', 'PTKNEG'),
 ('bewertete', 'bewertet', 'ADJA'),
 ('Filme', 'Film', 'NN'),
 (')', '--', '$('),
 ('.', '--', '$.')]

This is linguistically not entirely correct, but I don't get a Python exception.

thaitrinh commented 3 years ago

Hi Christian, sorry for my late reply!

I used try catch to print out the sentences, which cause the error. However, the same as you wrote in your answer to DaryLee, if I re-apply the tag_sent function to the printed-out sentences, then I don't get Python exceptions. So it is still not understandable for me.

I will have a look closer once I have more time. I will let you know.

Thank you and best wishes!

wartaal commented 3 years ago

I could now reproduce the error with a very strange sentence as input consisting of en extremely long mixture of words and non-words.

I could solve the problem for this sentence, but some more testing is needed to see whether the solution works always and whether it does not cause other errors. I guess there are still some similar situations that will lead to a next error now.

if you like to experiment, here is my new code for the problematic method. I will however do some testing next week. I added two lines:

def tag_sent_viterbi(self, sent, casesensitive = True):
        lowerbound = -1e6
        table = []
        backpointer = []

        for i in range(len(sent)):
            w = sent[i]

            if i == 0:
               cs = False
            elif casesensitive:
               cs = True
            wprobs = dict(self.tag_word(w,casesensitive=cs,conditional=True))
            if len(wprobs) == 1 and 'UNKNOWN' in wprobs: #This should not occur but can result from wrong settings
               wprobs = {}
            row = {}
            backpointer.append({})
            if i == 0:
                prevrow = {State(p2=None, p1='<Start>'): 0.0}
            else:
                prevrow = table[i - 1]
            # Only continue with 5 top states
            if len(prevrow) > 5:
                rowbound = sorted(prevrow.values(), reverse=True)[5] - 1
            else:
                rowbound = lowerbound
            for prev in prevrow:
                lp0 = prevrow[prev]
                if lp0 < rowbound:
                    continue
                lp_t = self.LP_trans_word[prev].items()
                for c, lp_tc in lp_t:
                    if c not in wprobs and len(wprobs) > 0:
                        continue
                    if c == '<END>': #2020-11-11 We are not in the last row, so adding state <END> makes no sense
                        continue
                    if len(wprobs) ==  0: #If the word is unknown anything goes
                        lpwc = 0
                    else:
                        lpwc = wprobs[c]
                    lp = lp0 + lp_tc + lpwc
                    c2 = prev.p1
                    newstate = State(p2=c2, p1=c)
                    if lp > row.get(newstate, lowerbound):
                        row[newstate] = lp
                        backpointer[i][newstate] = prev
            table.append(row)
        # last row
        prevrow = table[-1]
        row = {}
        backpointer.append({})
        for prev in prevrow:
            lp0 = prevrow[prev]
            lp_t = dict(self.LP_trans_word[prev])
            lp = lp0 + lp_t.get('<END>', -math.inf)
            if lp > row.get('<END>', -math.inf):
                row['<END>'] = lp
                backpointer[-1]['<END>'] = prev
        table.append(row)

        if self._debug:
            pprint.pprint(table)

        tags = []
        state = '<END>'
        for i in range(len(backpointer) - 1, 0, -1):
            state = backpointer[i][state]
            tags.append(state.p1)

        return tags[::-1]
wartaal commented 3 years ago

With the update the problem seems to be solved