wartaal / HanTa

The Hanover Tagger - A simple approach to lemmatization and POS-tagging of German morphology based on heuristics and hidden markov models
GNU Lesser General Public License v3.0
47 stars 2 forks source link

Word-splitting working for created test data, but not for HanTa analyzer? #1

Closed jfiala closed 4 years ago

jfiala commented 5 years ago

I added a few phrases for Anlagenbedienung/Anlagentechniker as a second corpus infile.

It then shows up correctly in the test data (labeledmorph_ger.csv):

4   Anlagenbedienung    Anlagenbedienung    NN  [('anlage', 'NN'), ('n', 'FUGE'), ('bedienung', 'NN'), ('', 'END_NN')]  ()
5   Anlagentechniker    Anlagentechniker    NN  [('anlage', 'NN'), ('n', 'FUGE'), ('techniker', 'NN'), ('', 'END_NN')]  ()
6   Anlagentechnikerin  Anlagentechniker    NN  [('anlage', 'NN'), ('n', 'FUGE'), ('techniker', 'NN'), ('in', 'SUF_NN'), ('', 'END_NN')]    ()

However, when running the tagger analyzer:

print(tagger.analyze('Anlagenbedienung'))
print(tagger.analyze('Anlagenbedienung',taglevel=0))
print(tagger.analyze('Anlagenbedienung',taglevel=1))
print(tagger.analyze('Anlagenbedienung',taglevel=2))
print(tagger.analyze('Anlagenbedienung',taglevel=3))

print(tagger.analyze('Anlagentechnik'))
print(tagger.analyze('Anlagentechnik',taglevel=0))
print(tagger.analyze('Anlagentechnik',taglevel=1))
print(tagger.analyze('Anlagentechnik',taglevel=2))
print(tagger.analyze('Anlagentechnik',taglevel=3))

print(tagger.analyze('Anlagentechniker'))
print(tagger.analyze('Anlagentechniker',taglevel=0))
print(tagger.analyze('Anlagentechniker',taglevel=1))
print(tagger.analyze('Anlagentechniker',taglevel=2))
print(tagger.analyze('Anlagentechniker',taglevel=3))

print(tagger.analyze('Anlagentechnikerin'))
print(tagger.analyze('Anlagentechnikerin',taglevel=0))
print(tagger.analyze('Anlagentechnikerin',taglevel=1))
print(tagger.analyze('Anlagentechnikerin',taglevel=2))
print(tagger.analyze('Anlagentechnikerin',taglevel=3))

It gives:

('Anlagenbedienung', 'NN')
NN
('Anlagenbedienung', 'NN')
('anlagenbedienung', 'NN')
('anlagenbedienung', [('anlagenbedienung', 'NN')], 'NN')
('Anlagentechnik', 'NN')
NN
('Anlagentechnik', 'NN')
('anlage+n+technik', 'NN')
('anlagentechnik', [('anlage', 'NN'), ('n', 'FUGE'), ('technik', 'NN')], 'NN')
('Anlagentechniker', 'NN')
NN
('Anlagentechniker', 'NN')
('anlage+n+techniker', 'NN')
('anlagentechniker', [('anlage', 'NN'), ('n', 'FUGE'), ('techniker', 'NN')], 'NN')
('Anlagentechnikerin', 'NN')
NN
('Anlagentechnikerin', 'NN')
('anlagentechnikerin', 'NN')
('anlagentechnikerin', [('anlagentechnikerin', 'NN')], 'NN')

So Anlagenbedienung and Anlagentechnikerin is not split correctly.

Can you give any hints why this happens or is it possible to "debug" the model? Or should I tune any thresholds to adapt the behaviour?

Thx & Best regards, johannes

wartaal commented 5 years ago

Hello Johannes,

this behaviour is more or less as wanted. The analysis partly can be changed by modifying the training data. For a long noun we have 3 possibilities how it can be analyzed:

  1. The noun is known from the training data. No decomposition is made
  2. The noun is unknown and it can be decomposed in a likely sequence of nouns, glueing elements and eventaully a suffix.
  3. No likeley decomposition is found. The algorithm just guesses the word is noun

In both cases you mention, option 3 is the case.

I didn't care about the splitting of compounds but focussed only on getting the correct POS and the correct Lemma. To avoid case 1 we should put more effort in compund analysis when generating the training data. For case 3 I am not sure what the best solution would be. Maybe we sould try a reanalysis if case 3 occurs and a output at level 2 or 3 is required.

jfiala commented 5 years ago

Hi Christian, The strange thing is that the nouns are part of the training data (I added a separate corpus, but with only a few lines of occurence) and it is correctly split in the training data file (labeledmorph_ger.csv).

So it should be in case 1 or 2, not 3 (as the likely decomposition has already been guessed in labeledmorph_ger.csv). The question is when it would make it into 2, as it already splitted correctly in the training data?

Thx + Best regards, Johannes

wartaal commented 5 years ago

Hi Johannes,

do I understand you correctly, that you created trainng data and trained yor own model? Great!

At least you could try to set observedValues=False but I am not sure that this will give the effect you want. Could you sent me the additional lines for the tarining data? Than I would first reproduce your finding and than see, whether there is an easy solution (though the goal of the project was lemmatization, not compound splitting; But I admit, that it would be nice if this would work as well, at least in easy cases).

At the moment this is just a one-man project and it might always take a few days before I find time to look at issues posted here.

Thanks for testing and best wishes

Christian

wartaal commented 5 years ago

Sorry, now I see that you already posted the additional lines. I will have a look at it

wartaal commented 5 years ago

The main problem is, that for the two words that are not splitted, the reading as one large unknonw noun is more likely than the reading as a compound. Actually, this is a problem in other cases as well (see the Paper on HanTa: https://corpora.linguistik.uni-erlangen.de/data/konvens/proceedings/papers/KONVENS2019_paper_10.pdf). The cache doesn't cause any problems here. This problem is not so easy to solve: making unknown words less likely will cause problems elsewhere.

Another issue that also might play a role here, is that the word "Technikerin" occurs once in the training data and is not analyzed as Techniker+in. Probably, because not enough evidence was found for Techniker being a corret word.

How did I find out: There is a possibility to forbid guessing the POS of unknown words.

Let us first see the probabilities for the POS in the normal mode:

print(tagger.tag_word('Anlagenbedienung'))
print(tagger.tag_word('Anlagentechnik'))
print(tagger.tag_word('Anlagentechniker'))
print(tagger.tag_word('Anlagentechnikerin'))

The result is:

[('NN', -21.655552285443104)]
[('NN', -21.44377361905479)]
[('NN', -23.968152171848878), ('NE', -26.85250643276605)]
[('NN', -24.809275254826144), ('NE', -28.673124051669205)]

Now we set

tagger.strict = True

and execute the same lines. Now we get:

[('NN', -24.12283361905479)]
[('NN', -21.44377361905479)]
[('NN', -23.968152171848878)]
[('NN', -26.068558253865785)]

So we see that the word become less likely if we don't assume the existence of unknown words.

The analyses we et now are:

('anlagenbedienung', [('anlage', 'NN'), ('n', 'FUGE'), ('bedienung', 'NN')], 'NN')
('anlagentechnik', [('anlage', 'NN'), ('n', 'FUGE'), ('technik', 'NN')], 'NN')
('anlagentechniker', [('anlage', 'NN'), ('n', 'FUGE'), ('techniker', 'NN')], 'NN')
('anlagentechnikerin', [('anlage', 'NN'), ('n', 'FUGE'), ('technikerin', 'NN')], 'NN')

You should not use the strict mode for analyzing text!! It will cause an exception if you try to analyze an unknown word! The mode is used only for building the model. Frequent words are always cached in the model. If these words are analyzed during training the computation is speeded up a bit by setting strict=True.

Another way to find out what is going on is to set

tagger._debug = True

This will give you the complete Trellis-diagram for each word.

Maybe a solution would be to enable the strict mode for external use, and give no analysis instead of an exception in case no analysis is found.

jfiala commented 5 years ago

Hi Christian,

Thank you for your in-depth analysis. Interesting that such an "easy" problem can cause such headaches :). I'll have a look at that and let you know then.

Best regards, Johannes