Closed meme closed 2 years ago
Hello,
apologies for a late reply, was busy on other projects the past months.
This is one of the hard problems, heh. I'm not exactly happy about unconditionally expanding the fields to 32 bits as that will significantly increase size of search data for projects that don't need that many symbols (for example with Magnum I'm at over 1.3 MB with 12k symbols and even with that the initial load times are starting to become problematic). One option could be to support both 16- and 32-bit sizes in a single format, depending on how much is needed.
How many symbols do you have in your case and what's the size of the generated searchdata-v0.js
(or .bin
) file, and how big is it gzip-compressed? Because I fear that even if the search supports that, the download size would be too big to be practical and you'd need to start looking for different solutions anyway.
Can we use variable-length integer encoding, like https://developers.google.com/protocol-buffers/docs/encoding#varints?
I will investigate sizes, have to fish that project back out of whatever hole I left it in. Still very interested in getting this work though.
Making JavaScript code robust is hard, so I'm trying to have as little JS as possible :) Variable-length encoding is possible, but the client side would need to unpack it first, which could offset the savings from faster downloads. I was personally thinking about pre-processing the data with for example Burrows - Wheeler transform but there's again the problem of having to decode them back on the client side.
I'd bet more on server-side compression which is transparent to the client, gzip already does quite a good job (the 1.3 MB file gets compressed to about 700 kB by gzip on transfer) and more advanced compression schemes (brotli, zstd) could do even better.
Curious: is the project public?
I want to generate documentation for radare2. It is quite large, but the current documentation "solution" is sub-par.
Additionally: could you help correct (or rewrite) the patch to use 32 bits if it's not a lot of up-front work so I can just give this a go to see if it is even worth the time?
Done in the search-32
branch, with all needed changes in 73564dfead18fa516780b9c2424430af6787ca69. The 100% test coverage fortunately makes such change rather easy :)
The resulting file size is about 10% larger, it's not as bad as I expected but I'm also not entirely happy about it. Will keep this in a branch until I get an idea how to solve it differently.
Just got a chance to try it out, it's snappy and looks beautiful. Thank you for your work!
The search reports: 70733 symbols (9624.4 kB)
You can view it here temporarily (commit is HEAD): http://sdf.org/~keegan/radare2-89cfe05d2/index.html
Nice :) The search data download took a few more seconds than would be acceptable for me. You're just above the 65k limit, so a couple ideas that could possibly trim down the search data size:
cb
field (there's like 10 of them) instead of the name of the struct containing it, so excluding members of simple structs could be an option (tangentially, having m.css support the INLINE_SIMPLE_STRUCTS
Doxygen option would be great, but I ran into quite severe issues with the XML output when trying that out last time -- see #83 for reference)I will note that there is no gzip compression enabled on that page. Aside from that, this is good advice that I will investigate... does this exist in the FAQ or Quick Start Guide? I imagine others would find it useful.
The first point needs changes on my side (there's nothing that could filter those out currently), and for the other two -- I don't recommend enabling M_SHOW_UNDOCUMENTED
and I think that's pointed out a few times there :)
Yeah I understand the M_SHOW_UNDOCUMENTED
, but this is an effort to improve documentation in the radare2 project, as well as provide a reference for developers wishing to integrate, so as of right now disabling it is not an option, unfortunately.
I will continue to explore the first two.
(Sorry for embarrassingly late replies, finally got a chance to get back to this project.)
This is finally fixed as of b0cf44e4ddbf42ce79a8612563e84e00e8a75808 and 0411e1812f1274bf6e49404c0dcdf649f7637076, I ended up making the type sizes variable so small projects still keep small search data binaries but larger projects can expand the limits where needed. This can raise the default limit from 65k and 16MB files to 4G for both. It unfortunately isn't automagic as data packing defaults are picked to keep file sizes implicitly small and estimating the sizes beforehand would be overly error prone. Instead, if the limits are hit, the doc generation aborts with a message suggesting to modify various configuration options, such as
OverflowError: Trie result ID too large to store in 16 bits, set SEARCH_RESULT_ID_BYTES = 3 in your conf.py.
Add the suggested options in conf.py
(or in the Doxyfile-mcss
with a M_
prefix) and restart the doc generation.
With this being implemented, I'm going to drop the search-32
branch, as it's not needed for anything anymore.
When running this tool on a large C codebase, I was experiencing errors with values out-of-bounds of the size of a short, so I modified
_search.py
to create a table with larger bounds:It seems that the changes required for
search.js
are somewhat less trivial since the code seems to be a little more over the place (hardcoded buffer seek offsets, etc.). I am wondering if this is the correct way to proceed with this issue, and if so, what changes are required to synchronize the search for this expanded data format?