[TR] Mmap word embedding models

ml4ai / skema

SKEMA: Scientific Knowledge Extraction and Model Analysis

https://ml4ai.github.io/skema/

Other

10 stars 4 forks source link

[TR] Mmap word embedding models #294

Closed enoriega closed 1 year ago

enoriega commented 1 year ago

The memory footprint of TR is too large, it requires at least 20GB to run. I believe this is due to a couple word embedding models we load in memory for grounding.

Can we mmap them such that the overall RAM usage is decreased?

kwalcock commented 1 year ago

There might be all of

/org/clulab/epimodel/epidemiology_embeddings_model.ser
/org/clulab/spaceweather/spaceweather_model_unigram.ser?
Gigaword for CosmosTextReadingPipeline
Glove for CluProcessor

CluProcessor for MiraEmbeddingsGrounder
FastNLPProcessor for OdinEngine

and more in there. The changes to get memory mapped embeddings would be quite extensive. It would probably be easier to start by trying to share Glove between the CluProcessor and CosmosTextReadingPipeline so that Gigaword is not necessary and then to share a single CluProcessor between the OdinEngine and MiraEmbeddingsGrounder so that FastNLPProcessor is not necessary. Those things should probably be done anyway.

enoriega commented 1 year ago

Right. I believe we should share those instances to reduce the memory footprint. Let's circle back about this after the hackaton.

myedibleenso commented 1 year ago

Our first goal is to reduce memory consumption below 16 GB (currently at 20 GB).

The next goal would be to reduce it to below 8 GB.

myedibleenso commented 1 year ago

@enoriega , can you please summarize your concerns about using a shared processor? @kwalcock , do we have code elsewhere to mem-map those two sets of embeddings?

enoriega commented 1 year ago

I believe we have to do it. I am only concerned that our extractions will change drastically if we change the processor type. We should be able to handle this monitoring the unit tests.

Ideally we should have a singleton processor shared among any pipeline instances

kwalcock commented 1 year ago

I have not seen any memory mapping code in any clulab or lum-ai project and probably would have noticed any related specifically to the embeddings. I suspect it will be slow, but slow might be worth it in this case, and it's nice to have options for varying constraints. I can imagine there needing to be a map of strings in memory, but that the values point to offsets to the vectors in the file. We will also probably need to extract from the jar to a local file at least once unless the files are distributed in a different way.

enoriega commented 1 year ago

Here are some extra thoughts:

If we can profile the web app memory footprint we could identify exactly which elements are loaded and maybe we don't need to mmap at all if we're loading unnecessary stuff
Can we quantize instead of mmap? Say, change the data type to float16 and slash the memory footprint by half
Maybe do some corpus statistics and erase the most infrequently used embeddings

enoriega commented 1 year ago

This looks relevant and readily available. If tables 3 and 4 transfer to our task, we're looking at a 100x reduction in size of the model, although we would need to add some book keeping to compose each embedding out of the code books. It doesn't sound too bad

Code: https://github.com/zomux/neuralcompressor Paper: https://arxiv.org/pdf/1711.01068.pdf

Perhaps we could try a quick experiment with one embedding model and if it looks good enough we can load it in Scala

kwalcock commented 1 year ago

Keith is working on further reducing the memory footprint.

enoriega commented 1 year ago

This was completed by @kwalcock