Setting to only cache recently requested icons in memory

markus-li commented 2 years ago

Looking at running the Iconify API on an embedded platform with limited RAM. For this particular use case it would be great to be able to have caching only done for icons "recently" accessed. The goal is to provide the full set of icons, but knowing that only a few hundred will ever be used on each device, the memory footprint loading all of them is probably not the best. The current solution has been to use the PHP version of the API, but since that version doesn't look like it's being maintained and everything else on this embedded platform is node.js based it would be nice to be able to use the node.js based API. Is this a feature you have planned for the API server? If not, would you accept a pull-request for something like that? Before spending the time developing this I wanted to check for interest in pull-requests since I don't want to maintain a separate fork.

cyberalien commented 2 years ago

Yes, it is planned for new version.

A while ago I had to raise memory for API servers to 2 gb because of increasing number of icons (servers were running out of memory when reloading icon sets), so wanted to implement this as well.

markus-li commented 2 years ago

That's great news for me! Sounds like the best is to sit tight and try your new version, is it available to test? I'd be happy to help out with testing and feedback as well as code if there's a need.

cyberalien commented 2 years ago

If everything goes according to plan, I'll publish first test version in about a week. It won't include search engine yet, but it will include all functions old version has.

Though things rarely go according to plan.

markus-li commented 2 years ago

Sounds great! Looking forward to it, hopefully things go according to plan.

cyberalien commented 2 years ago

Well... as expected, things didn't go according to plan.

While implementing data throttling to not store everything in memory, I ran into a big issue: reloading data on demand. It works great for almost all icon sets, but there are few icons sets that have several megabytes of data, biggest one is Fluent Emoji at 97mb. Loading is not an issue, it is done very quickly and asynchronously, but then data needs to be parsed. Running JSON.parse() on 97mb file takes on average 230ms on my computer in Node.js (140ms in Bun!... really looking forward to Bun being stable), so on actual server it will be even slower. Another related issue is few big icon sets use more memory than all other icon sets combined, so if they are all used, almost all memory is used anyway, making whole idea pointless. Which means this needs different approach.

My solution is to split icon set after first load into smaller chunks. For example, for Fluent Emoji it would mean bunch of 1-2mb files with different icons, so when icons are requested, only few small files are loaded. Only yesterday I've finished implementing it, it works great, but it was an unexpected delay.

So it will take a bit longer.

markus-li commented 2 years ago

Such is life and development. I had other development get in the way for the small pull-request I was to send your way regarding listening IP. Should get to it this weekend.

Those are some truly large icon sets, that is for sure. Glad you found a solution with chunking that works. Will it be possible to have the smaller chunks pre-processed and part of the icon-set repo with just a metadata json containing the information needed for searching and finding the right chunk and the checksum of such chunk? That way if only a small piece of an icon set changes it only needs to reload chunks with changed checksums.

I also wanted to ask you, are you implementing expiration from the cache in order to let icons not requested for x timeperiod to be unloaded form memory?

Looking forward to when you get the first draft pushed!

cyberalien commented 2 years ago

I was hoping to finish it by ViteConf, hoping news of proper self hosted Iconify could make some noise there, but due to various reasons it is not ready yet. I've made good progress, but API it is not usable yet.

It will take a bit longer. Sorry.

markus-li commented 2 years ago

That would have been nice, too bad it wasn't possible. I'm currently myself experiencing some of this with another project, it is what it is. It's the end-result that matters.

cyberalien commented 2 years ago

Finally it is ready: https://github.com/iconify/api.js/tree/dev3

It is not finished. Currently supports:

Sending icon data
Generating SVG ... so same things as old API

Not implemented yet:

Search engine
Automatically keep icon sets up to date. It is implemented in code, but cannot be triggered remotely yet
Documentation
Error logging (currently logs errors to console, plan is to add emailer functionality to automatically send logs)

What's different from old version:

Before app is ran it needs to be compiled. Run pnpm install && pnpm build to install dependencies and build it
It does not use icon sets package as dependency. Instead, it downloads package from sources. See below.
Efficient memory management. With default configuration, in my tests memory usage is around 40mb with 100 chunks of data in memory. For comparison, old version with everything in memory uses about 500mb. Data is stored in cache on load, loaded from cache as needed, purged when not used.

Biggest feature is support for many icon set sources. It can download and automatically keep up to date icon sets from NPM packages, GitHub/GitLab (using their APIs), any git repository (using git client) or locally stored icon sets. Plan is to add import from Figma documents, ability to clean up and serve custom SVGs. All functionality is there, just not plugged in where it needs to be.

New version doesn't use separate config files like old version did. Instead, there are 2 ways to change configuration:

Edit config in source. See files in src/config/
Use environment variables. Can be used only for main config, like port.

cyberalien commented 2 years ago

Today I've been testing it with real life data, found and fixed few bugs.

I've added logger to one of API servers this morning, which logs all requests (only URLs, no other data) for 200 seconds and tested those queries (except for search queries that aren't supported yet) on both old and new API processes running on my dev computer. Performance difference is nice:

Loaded from http://localhost:3100 9042 urls in 5.994 seconds
Loaded from http://localhost:3000 9042 urls in 4.51 seconds
Loaded from http://localhost:3100 9042 urls in 5.324 seconds
Loaded from http://localhost:3000 9042 urls in 4.184 seconds
Loaded from http://localhost:3100 9042 urls in 5.349 seconds
Loaded from http://localhost:3000 9042 urls in 4.72 seconds

Queries for http://localhost:3100 (odd queries) are for old API, for port 3000 is for new API. Memory usage difference was also huge, new API using 100mb at most.

So I think API is usable now.

markus-li commented 2 years ago

This all looks great, thank you for the hard work! I will build a container and test it in our embedded system (ARM) soon. Really looking forward to using this instead of the PHP version we had to go with due to memory constraints.

From what I can see the only way to change the icon-repository used is by editing the source code? We only want to use icons which don't have any of these licenses:

CC BY-NC 4.0
GPL
GPL 2.0
GPL 3.0

For this purpose we have a repo where they are filtered out, still need to add automatic updating to keep it in sync, but this is how it looks right at the moment: https://github.com/Oh-La-LABS/iconify-icon-sets

It'd be great if this was either in a source-file which only contained this setting (so as to not have to worry about auto-replacing it causing problems), or with an environment variable.

EDIT: Or is that what "src/config/importers/full-package.ts" is meant to be, is this file safe to assume it will not change what it imports and exports?

cyberalien commented 2 years ago

Yes, due to big choice and complexity of various import options, it can be changed only in source code and all src/config/* files are meant to be editable.

Filter for licenses is actually doable! I've added second parameter to filter option for full importer, which has access to icon set info, so you can filter by info. All icon sets have license.spdx, even though it is marked as optional, so you can use that to filter licenses.

Example filter callback for src/config/importers/full-package.ts (after last commit that adds second parameter):

filter: (prefix, info) => {
            const spdx = info.license.spdx;
            if (!spdx) {
                return false;
            }
            return spdx.slice(0, 3) === 'GPL' || spdx === 'CC-BY-NC-4.0';
        },

Just for reference, these are all currently used licenses (in spdx property), in case you want to filter something else:

Apache-2.0
MIT
CC-BY-4.0
CC0-1.0
ISC
Unlicense
CC-BY-SA-4.0
OFL-1.1
GPL-2.0-only
GPL-3.0
CC-BY-SA-3.0
CC-BY-3.0
GPL-2.0-or-later
GPL-3.0-or-later
CC-BY-NC-4.0

You can see which icon sets were loaded by loading /collections?hidden=1&pretty=1

If you are adding container config file for something like Docker, it would be awesome to have it here as well and I'd appreciate pull request.

markus-li commented 2 years ago

Thank you for that very detailed answer, once I've setup and tested a good Docker container file I'll push it, I only run them with Podman in production, but do test-builds with Docker internally. Since the destination is Podman it will be meant to run in rootless mode. Since you don't specify any engine requirements in package.json, what's the NodeJS version you use in production? 16?

cyberalien commented 2 years ago

Yes, I'm using Node 16 and good point about adding it to package.json.

I've also attempted to use Bun, but because it doesn't support child_process yet and workaround is far from trivial, it will take a while to support it.

markus-li commented 2 years ago

Bun would certainly be cool to see how well that could perform, but with the complications that looks to bring more power is certainly easier. I sent you the pull request for Docker, no docker-compose is included since I don't use that in our environment.

cyberalien commented 2 years ago

Thanks! Merged.

Bun definitely performs much faster. Even those speed tests for fetching 9k urls I've posted above, when ran in Bun are about 1.6s for old API, 1s for new API. In Node its 5 times slower.

markus-li commented 2 years ago

Happy I can help with at least something small! That is a serious amount of improvement, I do suppose that for specific workloads like this one it can be worth it.

Are you planning to provide auto-built containers in a repo?

cyberalien commented 2 years ago

Not sure because I don't have any experience building container.

Currently API servers are just plain vps with node process running behind nginx proxy, which takes about 10 minutes to configure, then they run for years with minor maintenance once in a while. Icon set updates are automatically pulled from GitHub without rebuilding anything. So I didn't really need to use containers and don't have any experience building them. In development I'm sometimes using docker, but without building anything myself, so no experience building containers.

markus-li commented 2 years ago

I see, I use containers internally, but not with Docker, and only auto-pushed to internal servers, not to public container repositories. The workflow would probably be a bit different than what I'm doing. Maybe someone else reading this has the right experience?

markus-li commented 2 years ago

Have been running this on our arm64v8 embedded platform and it runs perfectly. The only "issue" is that it takes quite some time before the CPU usage goes down to 0 during no load after first starting the service. Understandably it peaks really high at first, it then goes down to about 10% for a minute or so, then slowly down towards 1,6% within a few minutes. It takes 30+ minutes before down at basically 0. None of this is something I would call an issue, just wanted to mention it in case it is of interest. Idle memory usage is amazing and tends to go down to near 0.

cyberalien commented 2 years ago

Unfortunately I can't replicate that.

My guess is it is some weird garbage collector behaviour. When icon sets are loaded, it is done in queue, not simultaneously, so only 1 icon set's full data is being processed (loaded, split into chunks, written to cache, then data is purged by deleting object that references it). However, garbage collector works in mysterious ways, purging data when it wants to, not when variable is no longer referenced.

I've just added options to disable icon lists (implemented) and search engine (not implemented yet). So if API is used only to serve icon data to components, these options can be disabled to save more memory. Icons list doesn't use much memory, but search engine data will. In my tests it takes about 20mb only for keywords map for all 140k+ icons, with more advanced search that I want to add it would probably end up being something like 50mb, so those options will save a bit of memory.

markus-li commented 2 years ago

Could very well be, these systems run multiple services, so even though iconify may be idle, the rest is not. It makes it much harder to provide "untainted" data. But, as I said, nothing concerning about any of this. It is a very well-behaved nodejs app. Really love what you've done with this!

We will be using the search-functionality on these embedded systems when it's available, but the time for when searches are used will be sporadic and not even a daily occurrence, so it would be fantastic if search data could be unloaded from memory after x minutes of not being used.

cyberalien commented 2 years ago

I've added code to attempt to run garbage collector after memory is supposed to be freed. Seem to help with memory usage at startup.

Requires running script with --expose-gc flag: node --expose-gc lib/index.js

markus-li commented 2 years ago

That's sounds great! I'll try that out. One thing to note, when running things in containers command line flags are not very user-friendly, if possible I'd recommend using environment variables instead/also.

cyberalien commented 2 years ago

This can't be toggled in code, it requires running node with flag.

markus-li commented 2 years ago

If it is to always under all circumstances to run with this flag I suppose changing the init.sh to just include that would be all there is to do. Otherwise init.sh could be modified to check an environment variable for additional flags to pass to node.

cyberalien commented 2 years ago

Added flag to start command.

Also reduced memory usage a bit by not generating data needed for icons list if icons list is disabled. For all icon sets data for all icons uses 23mb with minimal info (~160 bytes per icon), 30-33mb with data needed for icons list.

markus-li commented 2 years ago

Sounds like some very good memory usage levels. I created a PR to add that flag to init.sh.

cyberalien commented 2 years ago

Merged. Thanks!

cyberalien commented 2 years ago

Added search engine! Now API has all functionality that old API has.

Search engine uses quite a bit of memory because it keeps index of all keywords in memory. Unlike icon data, it cannot be saved to cache and loaded on demand because it is rather complex to improve performance and save memory. Biggest issue is keywords point to icon objects, not icon names, which cannot be stored/loaded without breaking pointers.

If you don't need search functionality, it can be disable by:

Changing enableSearchEngine in src/config/app.ts
Env variable ENABLE_SEARCH_ENGINE=false in either command line ENABLE_SEARCH_ENGINE=false nr start or .env file
Disabling icons list option enableIconLists. If icons list is disabled, search engine is disabled too.

markus-li commented 2 years ago

This is awesome, I will definitely try it as soon as I have the opportunity! I definitely want the search functionality on our embedded deployments, but since search is not used very often, only occasionally, it'd be great to find a way to offload them to disk form memory. Not tested things yet, so who knows, maybe the footprint really won't matter. Thank you for the great work you've put into all this!

cyberalien commented 2 years ago

Icons list + search index add about 30mb to memory usage, using full icon sets package.

markus-li commented 2 years ago

That does sound like a footprint not worth chasing any further, there's services doing way less using way more. I'll be going on a trip soon so won't have time to test this much, but I will when I get back. Thank you again for this awesome piece of code :)

cyberalien commented 2 years ago

I think API is usable now. Archived old branch, moved new version to main branch.

New documentation is available: https://docs.iconify.design/api/

So far documentation includes most configuration options and API queries, including full documentation for search engine.

Most likely this API will be put on live servers this or next week, replacing old API servers. Will start with switching server in Frankfurt in 2-3 days. Responses are almost identical to the old API, except for search results and it supports all legacy stuff.

Closing this as completed.

cyberalien commented 2 years ago

New version is now running on all Iconify API servers.

It runs flawlessly so far. In stress test it could handle over 100k queries per minute, with actual traffic maximum was about 9.8k queries per minute on server in NJ, USA, which it handled without sweat and server load not going above 0.1.

Servers are basic cheap VPS from Linode and Vultr with 2gb memory. 9 servers all over the world for smallest possible latency, using AWS Route53 latency routing to redirect visitors to closest server, but only 2 biggest ones (in NJ, USA and in Frankfurt, DE) are handling almost all traffic.

iconify / api

Setting to only cache recently requested icons in memory #7