morfologik / morfologik-stemming

Tools for finite state automata construction and dictionary-based morphological dictionaries. Includes Polish stemming dictionary.
BSD 3-Clause "New" or "Revised" License
186 stars 44 forks source link

FSATraversal may return NOT_FOUND instead of AUTOMATON_HAS_PREFIX #92

Closed stevendolg closed 6 years ago

stevendolg commented 7 years ago

Hello,

I came across some behavior I find unexpected, but I am unsure if it is indeed unintended or not. Here's a unit test I created using release 2.1.3 of morfologik-fsa and morfologik-fsa-builders:

import static org.junit.Assert.assertEquals;

import java.util.ArrayList;
import java.util.List;

import org.junit.Test;

import morfologik.fsa.FSA;
import morfologik.fsa.FSATraversal;
import morfologik.fsa.MatchResult;
import morfologik.fsa.builders.FSABuilder;

public class MorfologikTest {

    @Test
    public void expected() throws Exception {
        List<byte[]> inputs = new ArrayList<>();
        inputs.add("a".getBytes("UTF-8"));

        FSA fsa = FSABuilder.build(inputs);

        FSATraversal fsaTraversal = new FSATraversal(fsa);
        MatchResult match = fsaTraversal.match("ax".getBytes("UTF-8"));

        assertEquals(MatchResult.AUTOMATON_HAS_PREFIX, match.kind);
    }

    @Test
    public void unexpected() throws Exception {
        List<byte[]> inputs = new ArrayList<>();
        inputs.add("a".getBytes("UTF-8"));
        inputs.add("ab".getBytes("UTF-8"));

        FSA fsa = FSABuilder.build(inputs);

        FSATraversal fsaTraversal = new FSATraversal(fsa);
        MatchResult match = fsaTraversal.match("ax".getBytes("UTF-8"));

        assertEquals(MatchResult.AUTOMATON_HAS_PREFIX, match.kind);
    }
}

The test "expected" is successful ("green"), but the test "unexpected" fails ("red"), because match.kind is "NO_MATCH".

Is "NO_MATCH" the intended result or is there something wrong with my test?

dweiss commented 7 years ago

The way it works is fine -- 'ax' returns no match because there was no such string in the dictionary. Prefix match is returned if your input string exists fully in the automaton, but does not correspond to any automaton string (think 'abc' encoded in the automaton and 'ab' query).

You can always traverse the automaton yourself -- FSATraversal are merely utilities, copy over the code and customize to your needs.

stevendolg commented 7 years ago

Sorry, maybe there's a misunderstanding.

In both cases the query "ax" is not accepted by the automaton, but its prefix "a" is. While in the first test the matching result is "AUTOMATON_HAS_PREFIX", in the second test it is "NO_MATCH". Shouldn't both cases produce the same matching result, since "a" is a prefix of "ax" and "a" is accepted by the automaton?

dweiss commented 7 years ago

In both of these cases no-match is the right result. I explained above what automaton_has_prefix means in this context, but look at the code and you'll see the semantics of these enums.

stevendolg commented 7 years ago

Well in the first case the actual result is MatchResult.AUTOMATON_HAS_PREFIX

dweiss commented 7 years ago

Darn, apologies Steven -- you're right, something is wrong here, I'll dig.

stevendolg commented 7 years ago

Much appreciated and no need to apologize!

dweiss commented 7 years ago

I looked at the code and to sure it looks wrong... it returns AUTOMATON_HAS_PREFIX only when the arc in the automaton is terminal (that is: there was no longer sequence encoded in the automaton and the processing needs to end).

I don't see it used anywhere in the code (other than a few irrelevant tests) and I wonder whether fixing this to actually work as it should is better than just removing the whole constant altogether... For larger automata there will nearly always be some kind of matching prefix (even if it's a single-letter one)... people who really need it can code manual traversal and those who do lookups will typically just need NOT_FOUND.

What is your scenario? How did you come across it?

stevendolg commented 7 years ago

I was trying to find the longest accepted string in an untokenized input sequence.

E.g.

This worked very well with a small test sample of license plate regions since I got a match of kind "AUTOMATON_HAS_PREFIX" and the first unmatched index in the input sequence. However when I completed the list of regions and added "Wels Umgebung", "Wien Umgebung", "Salzburg Umgebung" this suddenly stopped working and I got "NO_MATCH" instead.

I completely agree that a custom implementation of FSATraversal will do the trick and I think that's a valid solution. I was merely curious to see if that behavior was intentional or not, since I found it to be inconsistent.

dweiss commented 7 years ago

Thanks for filing the bug report and sorry again for hasty response -- this code hasn't been touched for years and it's funny things like that go unnoticed for so long (means nobody actually used this stuff!).

I think what I'll do is I'll fix the AUTOMATON_HAS_PREFIX to actually work as expected and correct the JavaDocs on that class. I'll need to review the use cases in existing code (not just in morfologik-stemming, but in other places as well) and I'm currently on short holidays, so I'll go back to it next week at the earliest. If you can temporarily copy/paste the traversal routine into your own code you'll have full control over how it works.

stevendolg commented 7 years ago

I can definitely do the traversal in our own code and since I'm still in the experimentation phase I'm not in a hurry either.

Thank you for your time and your work!

dweiss commented 6 years ago

Fixed and released, thanks Steven.