seiflotfy / cuckoofilter

Cuckoo Filter: Practically Better Than Bloom
MIT License
1.14k stars 111 forks source link

Really high false negative rate in ScalableCuckooFilter #50

Open duongcongtoai opened 5 days ago

duongcongtoai commented 5 days ago

The feature of having a scalable cuckooFilter is really convenient but for my usecase the application cannot tolerate false negative. To reproduce

func Test_ScalableCuckooFilter_FalseNegative(t *testing.T) {
    filter := NewScalableCuckooFilter()

    // filter := NewFilter(100000)
    exist := []string{}
    removed := []string{}
    for i := 0; i < 15000; i++ {
        id := fmt.Sprintf("%d-", i) + uuid.NewString()
        exist = append(exist, id)
        removed = append(removed, id+"to-removed")
    }
    for i := 0; i < len(exist); i++ {
        filter.Insert([]byte(exist[i]))
    }
    for i := 0; i < len(exist); i++ {
        filter.Insert([]byte(removed[i]))
    }

    for i := 0; i < len(exist); i++ {
        filter.Delete([]byte(removed[i]))
    }
    for i := 0; i < len(exist); i++ {
        filter.Insert([]byte(removed[i]))
    }
    falseNegatives := []int{}
    for i := 0; i < len(exist); i++ {
        exist := filter.Lookup([]byte(exist[i]))
        if !exist {
            falseNegatives = append(falseNegatives, i)
        }
    }
    assert.Equal(t, 0, len(falseNegatives))
}

And the result

    scalable_cuckoofilter_test.go:43: 
            Error Trace:    scalable_cuckoofilter_test.go:43
            Error:          Not equal: 
                            expected: 0
                            actual  : 160
            Test:           Test_ScalableCuckooFilter_FalseNegative
--- FAIL: Test_ScalableCuckooFilter_FalseNegative (0.02s)
duongcongtoai commented 5 days ago

I think this happens because during Delete, it tries to remove the element from all the children, and accidentally remove the footprint of other elements in these children

    for _, filter := range sf.filters {
        if filter.Delete(data) {
            return true
        }
    }
duongcongtoai commented 2 days ago

https://cis.temple.edu/~wu/research/publications/Publication_files/Chen_ICNP_2017.pdf

According to the section "ReliableDelete" the fingerprint computation across children filter should be the same, but this is not the case

func getIndexAndFingerprint(data []byte, bucketPow uint) (uint, fingerprint) {

Each time a new filter is created, bucketPow changes

duongcongtoai commented 2 days ago
func configure(sfilter *ScalableCuckooFilter) {
    if sfilter.loadFactor == 0 {
        sfilter.loadFactor = DefaultLoadFactor
    }
    if sfilter.scaleFactor == nil {
        sfilter.scaleFactor = func(currentSize uint) uint {
            return currentSize * bucketSize * 2
        }
    }
    if sfilter.filters == nil {
        initFilter := NewFilter(DefaultCapacity)
        sfilter.filters = []*Filter{initFilter}
    }
}

If i make the scaleFactor to remain the same size as the previous filter, the test mentioned in this issue passes

duongcongtoai commented 2 days ago

But after adding this, the false positive rate also increases, which is also mentioned by the paper (false positive correlates with the count of the children bloom filter and fingerprint size)

I'm also opening a PR to this fork: https://github.com/panmari/cuckoofilter, which has large fingerprint size and provide more reliable FP rate