A bug in the fuzzy hashing caused by bit width

fridex commented 3 years ago

Describe the bug

The input to the fuzzy hashing algorithm does not behave on the binary level as expected, see:

https://github.com/thoth-station/storages/blob/989b625cfd788da1b950cd458763598655234b2d/thoth/storages/graph/postgres.py#L3729-L3733

This causes different size allocation for different integer sizes. This means that different integers will have allocated different size causing the fuzzy hashing not working properly. Each integer should be assigned a constant width (with zero padding) so that fuzzy hashing can recognize differences across hashing for inputs with different sizes.

To Reproduce

>>> a = 1
>>> a.bit_length()
1
>>> a = 100
>>> a.bit_length()
7

Expected behavior

The hashing should be done on a fixed size integers.

goern commented 3 years ago

@fridex what do we need to do about this?

fridex commented 3 years ago

@fridex what do we need to do about this?

Use fixed with for integers when hashing a list of hashes. Otherwise fuzzy hashing will not produce reasonable results.

goern commented 3 years ago

ja, and do we know all the places where we need to fix the width? is that something that a unit test should cover?

fridex commented 3 years ago

ja, and do we know all the places where we need to fix the width? is that something that a unit test should cover?

Yes, I think so.

sesheta commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

fridex commented 3 years ago

/remove-lifecycle stale

KPostOffice commented 3 years ago

To solve this we need to extend _fuzzy_hash method to accept width and come up with good numbers based on a reasonable MAX for each id field and we will also need to update the hashes in the database right?

fridex commented 3 years ago

To solve this we need to extend _fuzzy_hash method to accept width and come up with good numbers based on a reasonable MAX for each id field and we will also need to update the hashes in the database right?

Yes, that's correct. 👍🏻 The old entries can stay as they are. As we have datetime/version info and old stacks are not that relevant to us, we can filter them (not to spend too much time on this work).

goern commented 3 years ago

@fridex @KPostOffice can one of you take this and work fix it?

fridex commented 3 years ago

Let's have this low priority. In the worst case, we will have bad hashes for an initial set of stacks kept. These hashes can be recomputed later on once a fix is done. Also, we do not have any use of these hashes planned as of now.

/priority important-longterm

sesheta commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

goern commented 3 years ago

/remove-lifecycle rotten /lifecycle frozen

sesheta commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

sesheta commented 3 years ago

@sesheta: Closing this issue.

In response to [this](https://github.com/thoth-station/storages/issues/2098#issuecomment-904960590): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

fridex commented 3 years ago

/reopen /triage accepted /remove-lifecycle rotten

sesheta commented 3 years ago

@fridex: Reopened this issue.

In response to [this](https://github.com/thoth-station/storages/issues/2098#issuecomment-905383577): >/reopen >/triage accepted >/remove-lifecycle rotten Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

sesheta commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

sesheta commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

sesheta commented 3 years ago

@sesheta: Closing this issue.

In response to [this](https://github.com/thoth-station/storages/issues/2098#issuecomment-926553703): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

thoth-station / storages

A bug in the fuzzy hashing caused by bit width #2098