Consider increasing number of bits for solver document id and their names

fridex commented 3 years ago

Is your feature request related to a problem? Please describe.

As of now, we use 32 bits to generate a unique identifier for solver documents:

https://github.com/thoth-station/common/blob/da1fb803bb3b131bbbf21cc393250c306f05cf81/thoth/common/openshift.py#L1052-L1054

This can lead to collisions when generating solver ids. The collision might result in overwriting documents stored on ceph.

Describe the solution you'd like

Consider increasing the number of bits for solver document names respecting the probability for collisions and the number of packages we want to have solved. See https://en.wikipedia.org/wiki/Birthday_attack#Mathematics as a table computing the probability of collisions.

Note we do not distinguish solver type (operating system, python version) in the solver name itself. This means that all the solvers share "namespace" in which the solver name should be generated. It might be a good idea to encode this in the solver name long term if we want to include different solvers in the system so they do not collide.

Additional context

As of now, we have ~2.2*10^5 package releases solved for ubi8-py36, the probability of a hash collision is more than 75% for 32 bits (see the linked wikipedia table) just for one solver. This probability grows with more solvers.

Note number of releases on PyPI (~2.2*10^6). We could include this number (and its growth) when designing the solver name.

Note also that the name of the solver is used in the workflow itself. Increasing the number of characters in the name can lead to kubernetes/openshift limitations.

Note this mechanism is used also in other parts of the project. Thus we could rething the approach in other places as well (adviser, ...).

fridex commented 3 years ago

To avoid overwriting docs on Ceph, we could also:

definitely increase the number of documents we can create considering the hash collision probability
make sure the generated document id does not collide with an already existing one (before scheduling)
or... make sure the generated id does not collide with an existing one on the workflow start (after scheduling)
use unique constraints in the database on document id in all tables so we do not accidentally sync different document ids

goern commented 3 years ago

sounds to me like we should go for 1 and 2. 3 and 4 seem to implement a solution at the wrong control point, if the workflow figures out that an id collides, what shall it do? fail? regenerate the id? ... ?

fridex commented 3 years ago

3 and 4 seem to implement a solution at the wrong control point, if the workflow figures out that an id collides, what shall it do? fail? regenerate the id? ... ?

would generate the document id during the workflow itself (as a very first step). The problem here is that we use document id as a request identifier as well which makes it chicken-egg problem - we cannot schedule a workflow without knowing document id as well. In case of 3. we could have a translation table mapping request id to document id. In that case, request id does not need to have too many bits to distinguish requests as they can be valid for a shorter period of time, and hence they can fit to pod name/workflow name. Specifically, kubernetes pod name can be a limitation we can hit if we stick just with 1. and 2 (as kubernetes does not need to provide enough characters to encode 64bits and more). Hopefully, I'm clear here.
is just to make sure we do not sync these docs accidentally and assign wrong values to packages (e.g. solver results for ubi8-py38 overwrite ubi8-py36 results on collision) - I think we already do it, but double check valuable here.

goern commented 3 years ago

sounds to me like "lets separate document name/id from pod/workflow/.... name", the latter seem to be implementation details of the workflow, do not need to carry and document type (adsiver, solver, ...) or document id information at all?!

it also sounds like we should rethink if we can use https://docs.python.org/3/library/uuid.html#uuid.uuid5 ... the namespace could include document type, python package index. .... ?!

thoth-station / common

Consider increasing number of bits for solver document id and their names #1058