quixdb / squash

Compression abstraction library and utilities
https://quixdb.github.io/squash/
MIT License
406 stars 53 forks source link

Create cache file(s) for codec info #176

Open nemequ opened 8 years ago

nemequ commented 8 years ago

Every time the library is initialized it scans 1 or more directory looking for plugins, then parses the ini file for every one it finds. Usually this isn't really an issue; the library is only initialized once per process, and it's a fairly quick operation (I haven't benchmarked it, but my guess would be maybe 25 ms). Typically it is dwarfed by the time required for even a relatively small file to be compressed. Sometimes, though, it can be a problem. For example, it hurts throughput when trying to use the squash CLI for fuzzing with AFL.

It would be possible to create a cache file which contains all the data for all the plugins and codecs in a directory, which could be used to avoid scanning all the plugins (assuming the mtime on the cache is after the plugin dir). Three possibilities come to mind:

  1. Use an INI file. This would save us from having to scan all those files, and instead give us one (relatively) big one. We already have an INI parser, the format could be basically the same as the per-plugin squash.ini files. Definitely the easy way.
  2. Use a library for on-disk storage (like lmdb, leveldb, bdb, etc.) and rewrite SquashContext/SquashPlugin/SquashCodec to just look up whatever they need on-demand. Probably faster start-up time, but more expensive per-operation.
  3. Create a format we can mmap. SquashPlugin and SquashCodec could just be pointers into the data. This could be very fast, but a big PITA to implement. It could also be complicated once we start supporting multiple contexts and plugin filtering.

I'm leaning towards 1, at least initially. I think the performance would be acceptable, and if not we could always look at 2 or 3 down the road.

One interesting side-effect is it should make getting Squash to work with emscripten easier, since we can't really walk through a directory tree over HTTP. 2 might be doable (depending on the library) with emscripten, but 3 would probably be quite complicated. Obviously this would no longer be a cache with emscripten, but the way to specify where to find plugins.

hyc commented 8 years ago

If you use LMDB for (2) you can get (3) at the same time, painlessly.

nemequ commented 8 years ago

If you use LMDB for (2) you can get (3) at the same time, painlessly.

Kind of. It's memory mapped, but that's not really the interesting part of (3). If we went with a custom format, we could have everything indexed by offset instead of using a key. For example, something like

typedef struct {
  unsigned int offset;
  unsigned int plugin;
  unsigned int priority;
} SquashCodec;

SquashPlugin* squash_codec_get_plugin (SquashCodec* codec) {
  return (SquashPlugin*) ((((void*) codec) - codec->offset) + codec->plugin);
}

const char* squash_codec_get_name (SquashCodec* codec) {
  return (const char*) (((void*) codec) + sizeof(SquashCodec));
}

Not only is there almost no parsing involved, we wouldn't even have to allocate storage for objects. It doesn't even really matter that the file can be mmaped; they are small enough that we can just read the whole thing and keep it in-memory at all times. LMDB is awesome, but since we don't have to worry about writes (we can just regenerate the whole cache when something changes) I think a custom format could be much more efficient.

The comparison of (1) vs (2) is more interesting, because (2) wouldn't require significantly more effort… it may even be easier. However, adding a dependency can be problematic, especially on Windows where we probably will not be able to use a system copy of a shared library. Including a 3rd party library in the code would probably more than double the size, and I'm already a bit worried about the size of the library.

For LMDB, I'm also a bit worried about portability. I can't find anything about Windows or emscripten support… does LMDB work there? Also, if we include LMDB instead of linking to a system copy (Windows…), it would subject users to the requirements of the LMDB's BSD-style license (i.e., putting a copy of the license in the documentation), which may be a bit confusing.

hyc commented 8 years ago

Yes, LMDB works on Windows. And pretty much everything else too. https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database