Open DavidNemeskey opened 7 years ago
The config file you provided does not use emToken. Please, make clear which module causes the problem (emMorph+emLem? emTag?), and change the issue accordingly.
You are right, I meant emLem (or whatever hu.nytud.gate.morph.HFSTMorphAndLemma
is called). As I said before, the third step in the config file (emTag) is not necessary to reproduce the problem, but perhaps it gives more information. So it's either the first or the second step; I guessed the second (emLem), because when only the first step is run, the output seems OK (aside from #9).
OK, as I suspected, the problem is the .<newline>.
part. HungarianTokenizerSentenceSplitter
returns it as a ..
token. However, HFSTMorphAndLemma
takes the string from the original text, not from the tokenizer, and so it passes the newline to HFST, which perceives it as two words.
There could be two solutions to this issue:
HungarianTokenizerSentenceSplitter
should not join token( fragment)s separated by newline(s)HFSTMorphAndLemma
should consume the token returned by the tokenizer, and not look at the original textI vote for the second option, because
BTW the source of the HFST wrappers should be made available as well.
@sassbalint Opinions? Could you also assign the people responsible for these components to this issue? Thx.
Created a pull request (#12) in which HFSTMorphAndLemma
takes the token from the tokenizer (string
feature). However, the next module (PurePos) at least, also reads the original text, so this is still not a final solution. It would be really good to have an agreement about this.
Ping.
When using the MagyarLanc tokenizer with emMorph+emLem, some input can get the output messed up. The
anas
of the next word will be empty, and it will be "shifted" to the word after that. This shift affects the whole text, but it skips space tokens, which remain correct. The shift adds up: i.e. if there is another problematic token down the line, the analyses of words after that will shifted by two positions, and so on. What's more, the shifted analyses stay in the memory of the GATE server, and will affect any subsequent text passed to the server.Because GitHub + attachments don't go well together, I list the information needed to reproduce this problem below. The third step in the pipeline is not required, but I added it so that the lemma is exposed as well. As I see it, the problem is casued by the
.<newline>.
part, which the tokenizer tokenizes into a single token..
(btw. the downloadable version of ML3 returns two separate tokens in this case). However, emLem (I guess) returns the lemma with the newline in it. Whether this is related to the problem or not, I don't know.Input:
Configuration:
Relevant parts of the output: