tower120 / hi_sparse_bitset

Hierarchical sparse bitset
43 stars 1 forks source link

PartialOrd/Ord implementation possibility #35

Open 3Hren opened 5 months ago

3Hren commented 5 months ago

Hi again!

There is the final step to take full advantage of your great library.

Sometimes it is necessary to deduplicate bitsets or to build a mapping between bitsets and some values. There are two standard approaches: using HashMap or BTreeMap.

The first one requires to calculate bitset's hash, which, in case of large bitsets, can be painful. However, for BTreeMap it seems that a multilevel approach to comparing two bitsets will result in a very fast comparison, also lazy bitsets seems to perfectly fit in this case. Since the PartialEq is already implemented it seems that it will no harm to also implement PartialOrd/Ord. Or is there something that prevents this from being implemented?

tower120 commented 5 months ago

If you want to "deduplicate bitsets", why not substract? S1 - S2 will give you S1 without S2 elements.

If you want to map indices to values, you need https://github.com/tower120/hi_sparse_array . It is somewhat like hi_sparse_bitset, but last level points to the actual data. It is still in research phase.

And what is the meaning of sets order? When S1>S2?

3Hren commented 5 months ago

By deduplicating/mapping I meant that suppose we have the following mapping:

{
  0b10101 -> 0,
  0b10101 -> 1,
  0b10001 -> 2,
  ...
  0b10001 -> 100500,
}

I would like to transform it in a such way:

{
  0b10101 -> [0, 1],
  0b10001 -> [2, 100500],
  ...
}

Using BTreeMap this is trivial.

By order I mean that in fact each bitset can be represented as a very large unsigned integer. So why not allow to comparing them?

tower120 commented 5 months ago

Looks like you need hi_sparse_array's multi_merge. There is a version of it now in repository, but I rewriting it to new more space efficient data structure.

Ord is probably possible. But looks like what you want to do will be inefficient (since you'll need to process each bitset one-by-one). Again, looks like you need merge. And I don't clearly see how to implement Ord... Especially for non TRUSTED_HIERARCHY. It's not trivial algorithm (maybe I'm wrong).

tower120 commented 5 months ago

Well, if you "just" need Ord, and don't care about speed - you probably can compare block_iter() between two sets. (Don't forget to take block starts into count as well).

3Hren commented 5 months ago

Isn't it enough to just to compare levels one-by-one? First try level0 indices, if they are equal then level1, and finally, if again equal - block-by-block? I'm not familiar with TRUSTED_HIERARCHY internals, will take a look.

Well, if you "just" need Ord, and don't care about speed - you probably can compare block_iter() between two sets. (Don't forget to take block starts into count as well).

Of course I care, that's why I am here ;)

Anyway, thanks for the answer!

tower120 commented 5 months ago

For TRUSTED_HIERARCHY maybe that is enough (I can't see that far without practical implementation).

But if you work, for example, with lazy(!) result of intersection - raised bits in level bitmasks may lead to EMPTY data blocks. That case is called non-trusted-hierarchy. You can't trust hierarchy and just compare level bitmasks. From Eq implementation experience thou, there is a "special case" when you compare TRUSTED_HIERARCHY to !TRUSTED_HIERARCHY - then some optimization possible. Otherwise - (as far as I remember) it ends up with plain data block comparison one-by-one.

tower120 commented 5 months ago

It is just with multi_merge/fold_merge you basically OR bitmasks of all maps/arrays at once, and then just apply resolve function to intersected data items in the very last level. At least on paper that should be super-linearly faster then what you trying to do.

3Hren commented 5 months ago

Ah, I see it's more complicated than I thought, interesting nonetheless. Thanks for the clarification!