lemire / FastBitSet.js

Speed-optimized BitSet implementation for modern browsers and JavaScript engines
Apache License 2.0
157 stars 19 forks source link

Base64 import/export of a bitSet ? #22

Open glorieux-f opened 7 months ago

glorieux-f commented 7 months ago

Have you thought to implement an import/export of a BitSet as a base64 encoding string, so you can send or receive BitSets from server ? If you think it could be useful feature for your project, any suggestion for the implementation is welcome, I need this kind of thing for a project. I need around 200 sequential bits, so it makes a quite short simple string.

lemire commented 7 months ago

In practice...

bitSet.words

is an array of 32-bit integers.

So it is a simple matter of converting arrays of 32-bit integers to and from base64.

It ought to be fairly easy to do. I have never written a base64 codec in JavaScript but it is not very hard to do.

It may not be possible to use an existing solution, but coding it up is probably a good week-end project.

Pull request invited.

glorieux-f commented 7 months ago

Thanks for fast answer, and your pedagogy.

I thought that native function like atob() and btoa() should be faster than vanilla code, but according to this benchmark (that I have not reproduced), it may be faster to write our own functions. This is better for understanding, and also, control on some parameters (little or big endian ; safe chars in URL +/ -> -_). This time will be useful to ensure encoding/decoding server side (for this project it is PHP). Trying to write my own solution from scratch, I was mature enough to understand that people has a more tested code to acclimate.

It is 4:12 in the morning here (after some beers tonight), I’m going to bed happy to have found a working solution, but give me some sleep and tests before submit you a pull request.

glorieux-f commented 7 months ago

I’m implementing a BitSet for server side (in PHP), to work with the base64 build with this javascript client, and I have questions and comments about your js implementation.

lemire commented 7 months ago

resize(), is there a reason (performance?) to avoid sparse array and fill empty places with 0?

Because you want it to be backed by an array. If you want it to be backed by an object, you can just do let s = {}; t[1] = true;. There is no need for a library, but you might be disappointed in the performance or memory usage. It depends on your use case. For sure, if you are not using an array as the backend, it is not a bitset. It is something else.

I was surprised by your writing (1 << index) to modify a bit. « The right operand will be converted to an unsigned 32-bit integer and then taken modulo 32 »

You may need (1<<(index%32)) or (1<<(index%64)) depending on your programming language.

glorieux-f commented 7 months ago

(1<<(index%64))

I was using (index&63), do you recommend modulo ?

Because you want it to be backed by an array

Sorry to have not search enough before asking the question, you have already elaborate on this issue https://github.com/lemire/FastBitSet.js/issues/8. I’m concerned because of PHP “arrays”, sparse but memory expensive. My goal is not to be the best implementation in the world, but if possible, not the worst. A recommended package in PHP world is backed on an array of “bits”, yes, a package to have $bits[1] = true. I got a working prototype backed on a PHP array of int. For better memory control, it is possible in PHP to use strings, they are mutable. I’m afraid of the cost of copy and casting, benchmark will decide.

lemire commented 7 months ago

I was using (index&63), do you recommend modulo ?

If the index is non-negative, then the result should be the same. It is possible that index&63 gives faster results though uncertain.

glorieux-f commented 7 months ago

Some stats from PHP.

index&63 has no sensible effects on performances, compare to index%64.

It is difficult to have precise memory sizes in this language, the params are set here to verify expectations about sizes. Native boolean random array is faster, it appears with a size similar as native int array, but it could be very heavy if it is full. String is lighter and compete not so bad with arrays, but cast and copy of chars to int have a cost. Native int pseudo array in php maybe lighter than continuous if data are sparse.

what count time weight
BitNoop [0, 9999999] “no operation” bitset
writes 10000 0.01 s. 1.84 Kb
reads 100000 0.1 s. 1.88 Kb
BitBool [0, 9999999] sparse array of boolean
writes 10000 0.01 s. 645.7 Kb
reads 100000 0.12 s. 645.73 Kb
BitString [0, 9999999] string of bytes
writes 10000 0.04 s. 4.05 Mb
reads 100000 0.22 s. 4.05 Mb
BitIntSparse [0, 9999999] sparse array of ints
writes 10000 0.03 s. 694.55 Kb
reads 100000 0.13 s. 694.59 Kb
BitInt [0, 9999999] array of int, continuous
writes 10000 0.1 s. 10.12 Mb
reads 100000 0.18 s. 10.12 Mb

PHP has no typed arrays with fixed size, the best way for performances seems to pay for a bigger server.

glorieux-f commented 7 months ago

Real life usage. Communicate a set of ints to a server. The new short url

The previous ones (could be much more longer when only one item is excluded from a set).

Humans (or humanists, maybe not so common humans) can work with a corpus of 1000-2000 texts, and build personal sets among those items for a search. Usually, such apps do not allow persistent and communicable URLs, a set is available only for a session. This feature allow to communicate such partitions in 2000/6~336 chars, a descent URL, below the 2000 chars limit.

After experiment, it is not a good idea to pack base64 converters in FastBitSet (even if I done it in my code), because it is of a larger interest, especially to replace btoa(), deprecated in Node.js.

What could be nice for FastBitSet is import from, or export to: bytes ; because string of bytes is the common i/o for Base64 encoders. I’m not confident enough to propose a pull request, but each methods could be a oneliner:

// this.length = «the "logical size" of this BitSet: the index of the highest set bit in the BitSet plus one.» (@see ava.util.BitSet)
// internally, native js array is copied as a typed array, then, the Int32 are read as bytes 
const bytes = new Uint8Array(new Int32Array(this.words).buffer, 0, (this.length + 7) >> 3));
this.words = Array.from(new Int32Array(bytes.buffer));

Real life example where a form allow to build a Bitset, send to a server as a Base64 string. image