Increase search table size(s)

meme commented 4 years ago

When running this tool on a large C codebase, I was experiencing errors with values out-of-bounds of the size of a short, so I modified _search.py to create a table with larger bounds:

diff --git a/documentation/_search.py b/documentation/_search.py
index cd8b643..efe2bc5 100644
--- a/documentation/_search.py
+++ b/documentation/_search.py
@@ -108,7 +108,7 @@ class ResultMap:
     #
     #  prefix  |  name  | \0 |  URL
     # id + len | suffix |    | suffix
-    # 16b + 8b |        | 8b |
+    # 32b + 8b |        | 8b |
     #
     # prefixed & suffixed item (flags & 0xb11 == 0b11):
     #
@@ -124,7 +124,7 @@ class ResultMap:
     #
     offset_struct = struct.Struct('<I')
     flags_struct = struct.Struct('<B')
-    prefix_struct = struct.Struct('<HB')
+    prefix_struct = struct.Struct('<IB')
     suffix_length_struct = struct.Struct('<B')
     alias_struct = struct.Struct('<H')

@@ -268,23 +268,24 @@ class ResultMap:
                 output += b'\0'
                 output += e.url.encode('utf-8')

-        assert len(output) == offset
+        # XXX
+        # assert len(output) == offset
         return output

 class Trie:
     #  root  |     |       header         | results | child 1 | child 1 | child 1 |
     # offset | ... | | result # | child # |   ...   |  char   | barrier | offset  | ...
-    #  32b   |     |0|    7b    |   8b    |  n*16b  |   8b    |    1b   |   23b   |
+    #  32b   |     |0|    7b    |   8b    |  n*32b  |   8b    |    1b   |   23b   |
     #
     # if result count > 127, it's instead:
     #
     #  root  |     |      header          | results | child 1 | child 1 | child 1 |
     # offset | ... | | result # | child # |   ...   |  char   | barrier | offset  | ...
-    #  32b   |     |1|   11b    |   4b    |  n*16b  |   8b    |    1b   |   23b   |
+    #  32b   |     |1|   11b    |   4b    |  n*32b  |   8b    |    1b   |   23b   |

     root_offset_struct = struct.Struct('<I')
     header_struct = struct.Struct('<BB')
-    result_struct = struct.Struct('<H')
+    result_struct = struct.Struct('<I')
     child_struct = struct.Struct('<I')
     child_char_struct = struct.Struct('<B')

@@ -421,8 +422,8 @@ def serialize_type_map(map: List[Tuple[CssClass, str]]) -> bytearray:
 # magic  | version | symbol | result |  type  |
 # header |         | count  |  map   |  map   |
 #        |         |        | offset | offset |
-#  24b   |   8b    |  16b   |  32b   |  32b   |
-search_data_header_struct = struct.Struct('<3sBHII')
+#  24b   |   8b    |  32b   |  32b   |  32b   |
+search_data_header_struct = struct.Struct('<3sBIII')

 def serialize_search_data(trie: Trie, map: ResultMap, type_map: List[Tuple[CssClass, str]], symbol_count, *, merge_subtrees=True, merge_prefixes=True) -> bytearray:
     serialized_trie = trie.serialize(merge_subtrees=merge_subtrees)

It seems that the changes required for search.js are somewhat less trivial since the code seems to be a little more over the place (hardcoded buffer seek offsets, etc.). I am wondering if this is the correct way to proceed with this issue, and if so, what changes are required to synchronize the search for this expanded data format?

mosra commented 4 years ago

Hello,

apologies for a late reply, was busy on other projects the past months.

This is one of the hard problems, heh. I'm not exactly happy about unconditionally expanding the fields to 32 bits as that will significantly increase size of search data for projects that don't need that many symbols (for example with Magnum I'm at over 1.3 MB with 12k symbols and even with that the initial load times are starting to become problematic). One option could be to support both 16- and 32-bit sizes in a single format, depending on how much is needed.

How many symbols do you have in your case and what's the size of the generated searchdata-v0.js (or .bin) file, and how big is it gzip-compressed? Because I fear that even if the search supports that, the download size would be too big to be practical and you'd need to start looking for different solutions anyway.

meme commented 4 years ago

Can we use variable-length integer encoding, like https://developers.google.com/protocol-buffers/docs/encoding#varints?

I will investigate sizes, have to fish that project back out of whatever hole I left it in. Still very interested in getting this work though.

mosra commented 4 years ago

Making JavaScript code robust is hard, so I'm trying to have as little JS as possible :) Variable-length encoding is possible, but the client side would need to unpack it first, which could offset the savings from faster downloads. I was personally thinking about pre-processing the data with for example Burrows - Wheeler transform but there's again the problem of having to decode them back on the client side.

I'd bet more on server-side compression which is transparent to the client, gzip already does quite a good job (the 1.3 MB file gets compressed to about 700 kB by gzip on transfer) and more advanced compression schemes (brotli, zstd) could do even better.

Curious: is the project public?

meme commented 4 years ago

I want to generate documentation for radare2. It is quite large, but the current documentation "solution" is sub-par.

meme commented 4 years ago

Additionally: could you help correct (or rewrite) the patch to use 32 bits if it's not a lot of up-front work so I can just give this a go to see if it is even worth the time?

mosra commented 4 years ago

Done in the search-32 branch, with all needed changes in 73564dfead18fa516780b9c2424430af6787ca69. The 100% test coverage fortunately makes such change rather easy :)

The resulting file size is about 10% larger, it's not as bad as I expected but I'm also not entirely happy about it. Will keep this in a branch until I get an idea how to solve it differently.

meme commented 4 years ago

Just got a chance to try it out, it's snappy and looks beautiful. Thank you for your work!

The search reports: 70733 symbols (9624.4 kB)

You can view it here temporarily (commit is HEAD): http://sdf.org/~keegan/radare2-89cfe05d2/index.html

mosra commented 4 years ago

Nice :) The search data download took a few more seconds than would be acceptable for me. You're just above the 65k limit, so a couple ideas that could possibly trim down the search data size:

there's a ton of structs and each struct member is one entry in search. It's unlikely that anybody will ever search for a cb field (there's like 10 of them) instead of the name of the struct containing it, so excluding members of simple structs could be an option (tangentially, having m.css support the INLINE_SIMPLE_STRUCTS Doxygen option would be great, but I ran into quite severe issues with the XML output when trying that out last time -- see #83 for reference)
something similar could maybe be done for enum values and the like, but i think that would actually hurt usability
the docs currently contain everything, including private APIs and C files, so cutting that down a bit could easily get you under 65k (ideally building the doc bottom-up for each relevant file, but for a project of such size that's a work for decades, heh)

meme commented 4 years ago

I will note that there is no gzip compression enabled on that page. Aside from that, this is good advice that I will investigate... does this exist in the FAQ or Quick Start Guide? I imagine others would find it useful.

mosra commented 4 years ago

The first point needs changes on my side (there's nothing that could filter those out currently), and for the other two -- I don't recommend enabling M_SHOW_UNDOCUMENTED and I think that's pointed out a few times there :)

meme commented 4 years ago

Yeah I understand the M_SHOW_UNDOCUMENTED, but this is an effort to improve documentation in the radare2 project, as well as provide a reference for developers wishing to integrate, so as of right now disabling it is not an option, unfortunately.

I will continue to explore the first two.

mosra commented 2 years ago

(Sorry for embarrassingly late replies, finally got a chance to get back to this project.)

This is finally fixed as of b0cf44e4ddbf42ce79a8612563e84e00e8a75808 and 0411e1812f1274bf6e49404c0dcdf649f7637076, I ended up making the type sizes variable so small projects still keep small search data binaries but larger projects can expand the limits where needed. This can raise the default limit from 65k and 16MB files to 4G for both. It unfortunately isn't automagic as data packing defaults are picked to keep file sizes implicitly small and estimating the sizes beforehand would be overly error prone. Instead, if the limits are hit, the doc generation aborts with a message suggesting to modify various configuration options, such as

OverflowError: Trie result ID too large to store in 16 bits, set SEARCH_RESULT_ID_BYTES = 3 in your conf.py.

Add the suggested options in conf.py (or in the Doxyfile-mcss with a M_ prefix) and restart the doc generation.

With this being implemented, I'm going to drop the search-32 branch, as it's not needed for anything anymore.

mosra / m.css

Increase search table size(s) #123