Roadmaps for Binary File Format implementation

OmnipotentEntity commented 6 years ago

As mentioned in #1740, there is a number of things that @gcp wants to hit when rethinking how network weights are done and implementing a binary format.

[x] Decide on a robust, extensible binary format
[ ] Implement conversion scripts
[ ] Engine Support
[ ] Training Support
[ ] Server Support
[x] Deciding on a 16-bit floating point format that covers the weight range without sacrificing precision
[ ] Add beta and gamma weights to batchnorm layers
[ ] process_bn_var might be better done at writing time

OmnipotentEntity commented 6 years ago

Re: to the 16-bit floating point format. Why not examine all of the weights and use the minimum number of exponent bits to cover the range needed? Then the rest can simply be significand? Finally, store the number of exponent bits in a nibble in the header.

gcp commented 6 years ago

Why not examine all of the weights and use the minimum number of exponent bits to cover the range needed? Then the rest can simply be significand?

Sounds like a good approach, though you will want to make sure the significand has at least, say, 7 bits of precision or so?

If you can't combine that with the exponent, you may have to round-to-zero for small values and clamp big ones.

odeint commented 6 years ago

Decide on a robust, extensible binary format

How about using HDF5 as a container format instead of rolling our own standard? It will be slightly less efficient than pure arrays just serialised and dumped (+ some minimal metadata), but there is a lot of infrastucture around HDF5 already, and it is very extensible.

gcp commented 6 years ago

How about using HDF5

No, the HDF5 libs are not a reasonable dependency to add to Leela, for offering almost nothing of value. Leela Chess uses protobuff in some places, but I'm not too fond of that either, for similar reasons.

Note that HDF5 nor Protobuffs would not have helped with any of the points mentioned. (Aside from "extensible" where I'm not even sure what it means in this context. You're not going to extend anything if the processing code in the engine doesn't support it in the first place)

OmnipotentEntity commented 6 years ago

I have previously created a JSON-based binary format that I used in a game I previously published. This might be overkill, but it is an option.

By extensible, I meant that the file format can be extended to handle other situations (without having to change the file parsing code significantly, and without harming backwards compatibility), such as the ELF weight stuff the project has had to adapt to in the past.

jokkebk commented 6 years ago

Files are so large that one could conceivably put in the decoding code first, then just read that in, eval it and use the function to "decompress" the data format. "What could possibly go wrong" 😄

If we'd use a separate "compress/decompress" scripts after downloading to produce a current format weight file (slightly slow, but would introduce no changes to rest of the pipeline), we could even arrange a competition out of it: Who'll make the smallest network files with reasonable packing/unpacking speed.

(just to clarify: latter is an actual idea, first paragraph is just attempt at humor)

gcp commented 6 years ago

I considered that (just wrapping the existing format with encoder/decoder scripts) because it's much more manageable than updating all the code that deals with networks.

But you can't solve the missing batchnorm layer that way, unfortunately.

marcocalignano commented 6 years ago

We do not to reinvent the wheel, have a look at https://github.com/LLNL/zfp

gcp commented 6 years ago

Looks interesting, but I doubt our weights data has 2D spatial correlation, so not clear how well it would work.

odeint commented 6 years ago

We do not to reinvent the wheel, have a look at https://github.com/LLNL/zfp

That looks really cool. Reading the paper and following the literature a bit, I also found this development: https://github.com/disheng222/SZ http://www.mcs.anl.gov/papers/P5437-1115.pdf

In the paper they explain that they try do exploit smoothness in the data (which we don't really have, at least not obviously), but in case none is found:

Effective Error-bounded Lossy Compression for Unpredictable Data. We optimize (also with O(N) time complexity) the error-bounded lossy compression for compressing the unpredictable data, by elaborately analyzing the binary data representation based on desired precision and data value ranges

That's a little bit unspecific, but it is explained later in the arXiv article (paraphrasing here):

There are three key steps to process.

First, we map all of the unpredictable data to a smaller range by letting all the values minus the median value of the range (denoted by med, i.e., med=(mini (ρ[i]) + maxi (ρ[i]))/2). The newly generated data after the transformation are called normalized data, which will be closer to zero than the original ones. Such a step is motivated by the fact that the closer the number is to zero, the less mantissa bits are required to meet the specified precision. [...]

Second, we truncate the value by disregarding the insignificant mantissa part based on the user-required error bounds. [...]

At the last step, we perform the leading-zero based floating-point compression method to further reduce the storage size. In particular, we perform the XOR operation for the consecutive normalized values and compress each by using a leading zero count followed by the remaining significant bits. This may significantly reduce the storage size due to the fact that the leading bits of the IEEE 754 binary representations for the similar values are expected to be very similar to each other. [...]

So basically, they also truncate, but after normalising first. In the last step they still try to profit somewhat from correlations in the data, despite this being the section about unpredictable data.

OmnipotentEntity commented 6 years ago

Following up on the JSON idea, because based on what I said before it might not be entirely clear what I meant, here's a text format version of what the current save file might look like:

{
  "meta" : {
    "fp_size" : 16,
    "exp_bits" : 7,
    "value_head_not_stm" : false
  },

  "weights" : {
    "blocks" : 20,
    "filters" : 256,
    "res" : {
      "conv_weights" : [[],[],[]],
      "conv_biases" : [[],[],[]],
      "batchnorm_means" : [[],[],[]],
      "batchnorm_stddevs" : [[],[],[]]
    },

    "policy_head" : {
      "conv_pol_w" : [],
      "conv_pol_b" : [],
      "bn_pol_w1" : [],
      "bn_pol_w2" : [],
      "ip_pol_w" : [],
      "ip_pol_b" : [],
      "conv_val_w" : [],
      "conv_val_b" : [],
      "bn_val_w1" : [],
      "bn_val_w2" : [],
      "ip1_val_w" : [],
      "ip1_val_b" : [],
      "ip2_val_w" : [],
      "ip2_val_b" : []
    }
  }
}

It's then serialized to binary (with a proper header) to save, and on restore deserialized from binary. Any field missing can have a default value assigned by the engine, so if the format of the nets change, for instance, by preprocessing the batchnorm, it's possible to simply add a field indicating that in the "meta" area of the save file. This way the decision on this and perhaps other things aren't blocking on save file format.

Remi-Coulom commented 6 years ago

Hi Leela people,

If you don't mind the self-promotion, let me tell you about joedb. I use it in my generic AlphaZero system to store the training sets as well as the network weights: https://www.remi-coulom.fr/joedb/intro.html

Joedb can manipulate vectors of numbers efficiently and conveniently: https://www.remi-coulom.fr/joedb/vectors.html#allocating-and-using-vectors

It has convenient features for automatic schema upgrade, if you ever need to add more fields or tables, or even make any other arbitrary change: https://www.remi-coulom.fr/joedb/schema_upgrade.html

At the moment, it is C++ only, though. So it is not a good choice if you have to manipulate data in python. But if your program is C++, joedb is a really convenient and efficient way to store binary data.

Storage is efficient (that is to say, storing a large vector of floats takes as little disk space as a raw dump of the vector + a few more bytes for the file format), and C++ code to conveniently manipulate the data is automatically generated.

I considered adding half-precision support, but decided to wait for C++ to support half. For the moment, half-precision numbers can be stored as 16-bit integers.

I'd be glad to answer questions if you wish to know more.

roy7 commented 6 years ago

Hi @Remi-Coulom! Pleasure to see you chiming in. :) Thanks for the info on joedb. Hopefully the more technical types will take a look or give it a try. @gcp is traveling this week.

gcp commented 6 years ago

The advantages don't seem that relevant to us, i.e. see first post, and the limitations are showstoppers (we need Python support, and flexible binary/float representation, etc).

To be honest I'd take something like the JSON over it if only because that's easily manipulated and well supported by literally everything. But remember that one of the main reasons to take a binary format is parsing and startup time. There's fast parsers for JSON, for sure, but it still means parsing text again... Not sure how good JSONB support is.

OmnipotentEntity commented 6 years ago

There are, of course, a few competing "standards" for binary encoded JSON-like data, and there's also the option of "rolling our own." Both are fairly easy to do, because JSON is by design easy to parse and represent in memory.

Here are some of the standards with a feature list.

Standard	C++ Implementation	Python Implementation	User Defined Types	Packed Binary Type(s)	Notes
BSON	Yes (multiple)	Yes (multiple)	Sorta	Yes	Used by MongoDB (impls gutted from MongoDB code, may be low quality)
BJSON	No	No	No	Yes	Just a website
UBJSON	Yes	Yes	No	Yes	Packed Binary is just uint8 array
MsgPack	Yes (multiple)	Yes (multiple)	Yes	Yes	Python impl is CPython
Smile	Yes	Yes	No	Yes	Seems quite old
CBOR	Yes (multiple)	Yes (multiple)	Yes (with a way to globally register new ones)	Yes	Native half float support, has an RFC, Python impl is CPython.

Of these MsgPack and CBOR seem to be the best suited.

marcocalignano commented 6 years ago

I was playing around with binary files. For these experiments I took the network 33986b7f9456660c0877b1fc9b310fc2d4e9ba6aa9cee5e5d242bd7b2fb1b166 with size

173M (180744218) 33986b7f9456660c0877b1fc9b310fc2d4e9ba6aa9cee5e5d242bd7b2fb1b166
 50M (51883084) 33986b7f9456660c0877b1fc9b310fc2d4e9ba6aa9cee5e5d242bd7b2fb1b166.gz

First thing I tried was to just dump the 4 bytes long floats in a binary file in the IEEE 754 notation without spaces. I had to add a counter at the beginning of each layer to tell how many floats there are. Of course this way we do not have any precision loss. We save almost half in file size but of course gzip is better in compressing text files than binary files.

92M (96086120) 33986b7f9456660c0877b1fc9b310fc2d4e9ba6aa9cee5e5d242bd7b2fb1b166.ieee754
52M (53482477) 33986b7f9456660c0877b1fc9b310fc2d4e9ba6aa9cee5e5d242bd7b2fb1b166.ieee754.gz

After this I applied something I already use in the past: analyzing the weight values in a layer we can find one factor per layer that map the actual floating point value to a range of floating point value between 0 and 65535. Then we can round this number up to the next integer (what I actually did was get the integer part) and save the number in a 16 bit value. Of course at the beginning of the layer with the count for the weights we have to save ONE floating point value this factor. Of course this way we have some loss but I build a python script (so it is possible to do that in python too!) to calculate the maximal error is low. (less than 0.001)

46M (48044128) 33986b7f9456660c0877b1fc9b310fc2d4e9ba6aa9cee5e5d242bd7b2fb1b166.factor
38M (39641997) 33986b7f9456660c0877b1fc9b310fc2d4e9ba6aa9cee5e5d242bd7b2fb1b166.factor.gz

I guess to know if the errors are really low we need to run validation between the two network original and compressed. I just wanted to know what you think of these experiments.

alreadydone commented 6 years ago

analyzing the weight values in a layer we can find one factor per layer that map the actual floating point value to a range of floating point value between 0 and 65535.

I think this is what we do to quantize the weights if we are to use 'INT16' inference.

OmnipotentEntity commented 6 years ago

Are there any objections to the CBOR format with a custom FP type that contains a user defined precision, and with the logical structure described upthread?

If not, then I'll slowly start working on this.

evdwerf commented 6 years ago

.... But remember that one of the main reasons to take a binary format is parsing and startup time.

To reduce startup time, perhaps just load the network in a separate thread? If the gtp connection doesn't block until the network is needed, users would probably perceive this as an improvement in startup time.

roy7 commented 6 years ago

Speaking of which, if the client didn't block until network is needed, that'd allow time left settings to be processed/etc before the genmove arrives. An issue I sometimes have with slow startup is if a player is doing a time control like Fischer or Byotomi with no main time and a short time period, it's possible for the bot to time out on the first move of the game since the game clock starts and time_left is sent to the bot before the network has been loaded (since bot is started when first move is needed). If startup takes 10 seconds, that 10 seconds before 'time_left' is loaded into the bot but by then the real clock is 10 seconds shorter and it might think too long.

If network loading is in a separate thread and commands like time_left can be processed immediately while network is loading, that'd make the bot's understanding of time left accurate, right? Could be a real improvement for bot operators. ;)

gcp commented 6 years ago

If startup takes 10 seconds, that 10 seconds before 'time_left' is loaded into the bot but by then the real clock is 10 seconds shorter and it might think too long.

The GUI should ping the bot (with sequenced GTP commands) to see if it is finished starting up, IMHO.

gcp commented 6 years ago

If network loading is in a separate thread and commands like time_left can be processed immediately while network is loading, that'd make the bot's understanding of time left accurate, right?

That's true, but I do really think you'll run into other problems unless you block until startup finishes. Like the bot forfeiting on time in very fast timecontrols.

users would probably perceive this as an improvement in startup time

But without a network loaded, you won't be able to interact much :-/ Maybe it's useful if you're going to send a sequence of play commands, I guess.

evdwerf commented 6 years ago

The typical sequence of GTP commands in GridMaster before starting a game is something like:

protocol_version list_commands name version boardsize ... clear_board komi ... kgs-rules time_settings

then after some time for the user to select the first move (or start the clock) we get:

play ...

and then finally where we need the network:

genmove w

So, we have in the order of 10 gtp commands that could be answered without having the network (fully) loaded. I would already be happy if only the first four (protocol_version, list_commands, name, version) would be answered instantaneously (GridMaster uses those to determine that the engine is alive and understands gtp -- this is also why on most older devices engine installs need the slow timeouts setting), but it's only a minor inconvenience. The compressed networks are a big win.

gcp commented 6 years ago

That makes sense, yes. Doing the load in a separate thread and blocking on that finishing just before we use it probably isn't too complicated, but it's going to be mostly messy due to the debug information that is printed out while the network is loaded and benchmarked.

evdwerf commented 5 years ago

I don't think it needs to get messy. Most of the initial gtp commands don't take noticeable time, so the program could just wait for a command that requires the network or the gtp input stream to go idle (and by then, if you really want to, you could still do it in a mode that blocks the main thread). Getting network loading and OpenCL tuning out of the initial startup phase would be a big win IMO.

OmnipotentEntity commented 5 years ago

Long overdue update. I became stalled looking for a good, modern, and fully featured JSON library to bring into the project. So I took some time and am making a contribution to nlohmann/json to add CBOR binary field support to the library. You can follow that here: https://github.com/nlohmann/json/pull/1662

Once that is complete, I'll begin working on this a bit more earnestly.

leela-zero / leela-zero

Roadmaps for Binary File Format implementation #1751