Closed fridex closed 3 years ago
@fridex what do we need to do about this?
@fridex what do we need to do about this?
Use fixed with for integers when hashing a list of hashes. Otherwise fuzzy hashing will not produce reasonable results.
ja, and do we know all the places where we need to fix the width? is that something that a unit test should cover?
ja, and do we know all the places where we need to fix the width? is that something that a unit test should cover?
Yes, I think so.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
/remove-lifecycle stale
To solve this we need to extend _fuzzy_hash method to accept width and come up with good numbers based on a reasonable MAX
for each id field and we will also need to update the hashes in the database right?
To solve this we need to extend _fuzzy_hash method to accept width and come up with good numbers based on a reasonable
MAX
for each id field and we will also need to update the hashes in the database right?
Yes, that's correct. 👍🏻 The old entries can stay as they are. As we have datetime/version info and old stacks are not that relevant to us, we can filter them (not to spend too much time on this work).
@fridex @KPostOffice can one of you take this and work fix it?
Let's have this low priority. In the worst case, we will have bad hashes for an initial set of stacks kept. These hashes can be recomputed later on once a fix is done. Also, we do not have any use of these hashes planned as of now.
/priority important-longterm
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten
/remove-lifecycle rotten /lifecycle frozen
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
/close
@sesheta: Closing this issue.
/reopen /triage accepted /remove-lifecycle rotten
@fridex: Reopened this issue.
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
/close
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten
@sesheta: Closing this issue.
Describe the bug
The input to the fuzzy hashing algorithm does not behave on the binary level as expected, see:
https://github.com/thoth-station/storages/blob/989b625cfd788da1b950cd458763598655234b2d/thoth/storages/graph/postgres.py#L3729-L3733
This causes different size allocation for different integer sizes. This means that different integers will have allocated different size causing the fuzzy hashing not working properly. Each integer should be assigned a constant width (with zero padding) so that fuzzy hashing can recognize differences across hashing for inputs with different sizes.
To Reproduce
Expected behavior
The hashing should be done on a fixed size integers.