gliwka / hyperscan-java

Match tens of thousands of regular expressions within milliseconds - Java bindings for Intel's hyperscan 5
BSD 3-Clause "New" or "Revised" License
177 stars 45 forks source link

Hyperscan java wrapper throws an exception when one scratch space is used with multiple databases #220

Open apismensky opened 10 months ago

apismensky commented 10 months ago

This article https://intel.github.io/hyperscan/dev-reference/performance.html#allocate-one-scratch-space-per-scanning-context says that one scratch space can be used with multiple databases. But this snipped of code produces an exception:

public class BugReproduce {
    public static void main(String[] args) throws Exception {
        try (
                Database dbA = Database.compile(new Expression("a"));
                Database dbB = Database.compile(new Expression("b"));
                Scanner scanner = new Scanner()
        ) {
            scanner.allocScratch(dbA);
            List<Match> matchesA = scanner.scan(dbA, "a"); // works
            showMatches(matchesA);
            scanner.allocScratch(dbB);
            matchesA = scanner.scan(dbA, "a"); // Exception - An invalid parameter has been passed. Is scratch allocated?
            showMatches(matchesA);
        }
    }
    private static void showMatches(List<Match> matches) {
        if (!matches.isEmpty()) matches.forEach(m -> System.out.println("Match start: " + m.getStartPosition() + ", end: " + m.getEndPosition()));
        else System.out.println("No matches in hyperscan");
    }
}

Most probably the bug is in the java wrapper, cause the following C++ code works fine (no error):

TEST(order, alexey2) {
    // patterns for dbA
    vector<pattern> patternsA;
    patternsA.push_back(pattern("a", 0 , 1));
    // compile dbA
    hs_database_t *dbA = buildDB(patternsA, HS_MODE_NOSTREAM);
    ASSERT_NE(nullptr, dbA);
    // Allocate scratch space
    hs_scratch_t *scratch = nullptr;
    hs_error_t err = hs_alloc_scratch(dbA, &scratch);
    ASSERT_EQ(HS_SUCCESS, err);
    // Scan string dataA against dbA
    const char *dataA = "a";
    CallBackContext cA;
    err = hs_scan(dbA, dataA, strlen(dataA), 0, scratch, record_cb,(void *)&cA);
    ASSERT_EQ(HS_SUCCESS, err);
    // Verify one match
    EXPECT_EQ(1, countMatchesById(cA.matches, 1));
    // Patterns for dbB
    vector<pattern> patternsB;
    patternsB.push_back(pattern("b", 0 , 1));
    // compile dbB
    hs_database_t *dbB = buildDB(patternsB, HS_MODE_NOSTREAM);
    ASSERT_NE(nullptr, dbB);
    // Use the same scratch for dbB
    err = hs_alloc_scratch(dbB, &scratch);
    ASSERT_EQ(HS_SUCCESS, err);
    // Scan string dataB against dbB
    const char *dataB = "b";
    CallBackContext cB;
    err = hs_scan(dbB, dataB, strlen(dataB), 0, scratch, record_cb,(void *)&cB);
    ASSERT_EQ(HS_SUCCESS, err);
    // Verify one match
    EXPECT_EQ(1, countMatchesById(cB.matches, 1));
    // Cleanup
    err = hs_free_scratch(scratch);
    ASSERT_EQ(HS_SUCCESS, err);
    hs_free_database(dbA);
    hs_free_database(dbB);
}
apismensky commented 10 months ago

The reason for the bug is that every call allocScratch calls scratch.registerDeallocator(); and this guy calls hs_scratch_t p = new hs_scratch_t(this); which effectively calls hs_scratch_t method twice for the same scanner object. We may fix that by checking if the deallocator instance is created, for example:

private static class NativeScratch extends hs_scratch_t {
        void registerDeallocator() {
            if (deallocator() != null) {
                hs_scratch_t p = new hs_scratch_t(this);
                deallocator(() -> hs_free_scratch(p));
            }
        }

@gliwka - lmk I can create a PR for that, but wanted to know what you think?

gliwka commented 10 months ago

@apismensky Thanks for the analysis. PR is welcome :-)

cspurk commented 1 week ago

Thanks for providing the fix @apismensky and for merging it @gliwka! Would it be possible to make a new release to easily make the fix available (and close this issue)?