gyorilab / gilda

Grounding of biomedical named entities with contextual disambiguation
BSD 2-Clause "Simplified" License
39 stars 12 forks source link

High memory footprint #95

Closed ravwojdyla closed 2 years ago

ravwojdyla commented 2 years ago

👋 thank you for all your hard work.

It looks like gilda has a pretty high memory footprint about ~1.5G, most of that comes from the load_terms_file. The grounding_terms.tsv file is about ~200M, compressed ~30M. For context the models via load_gilda_models take up about 256M of memory.

The gilda.get_grounder().entries dict is close to 1.5G in memory, with 1.6 million keys, and Term objects as values all preloaded into memory. Term object could be improved to be memory efficient e.g. dataclass with slots defined + more efficient types. All this unfortunately makes it problematic to use in a multi-process environment (e.g. pyspark), the startup is also slow since we need to create all term entries even if most of them might never be used.

Fortunately this is a solved problem, if I may suggest, maybe the grounding_terms should be distributed as some kind of binary format/index, this could be any kind of (constant) KV store built for read heavy ops even builtin dbm or shelve, but also specialised leveldb, sparkey, sqlite3 etc.

This way:

What do you think?

bgyori commented 2 years ago

Hi @ravwojdyla, thank you for the comment and suggestions! Before going into pros and cons, I wanted to ask, what approach do you use for memory profiling so I can also compare options using the same metric?

ravwojdyla commented 2 years ago

👋 @bgyori you can use memray for example.

ravwojdyla commented 2 years ago

@bgyori is there anything else that I could assist in?

bgyori commented 2 years ago

Hi @ravwojdyla, I started experimenting with different options and weighing advantages and disadvantages. However, I would generally recommend that if memory usage and startup time is an issue, that you run Gilda as a (local) web service that each of your parallel processes can communicate with. This is how we run Gilda in some applications (e.g., parallel dialogue sessions) where we don't want each individual process to maintain its own instance. Is this not a viable option in your case?

ravwojdyla commented 2 years ago

Is this not a viable option in your case?

@bgyori thanks for the update. In our context we run gilda inference as part of Spark task, setting up a web server as part of Spark is certainly feasible, but far from straightforward (handling the ops + remote comms). Especially if you compare it to using it like a library or other model inference use-cases. I would imagine that anyone using gilda from a parallel processing framework will run into this issue. Does that make sense?

bgyori commented 2 years ago

I see, let me try to press on this a bit more though. In principle, your nodes could communicate with the public web service running at http://grounding.indra.bio through HTTP requests instead of using Python library calls. What I'm suggesting is just that you could have a running web service instance of Gilda on some arbitrary local infrastructure (possibly independent of Spark), the only thing being required is that your Spark processes be able to send requests to it through a URL. This can also be done via Docker to avoid having to have a local Python environment configured (https://github.com/indralab/gilda#run-web-service-with-docker).

ravwojdyla commented 2 years ago

@bgyori sure all that is in theory feasible but I hope it's clear this it's significantly more complicated setup than using gilda as a pure py library? Re having our own Gilda server introduces state which requires babysitting. Doing HTTP requests (over the Internet) introduces latency, failure recovery and (arguable small) cost. OOC is this issue/use-case not something you intend to support out of the box?

dhimmel commented 2 years ago

Thinking about the future, it's reasonable to expect grounding terms to grow in size as more resources and entity types are supported. I can imagine tagging support for genomic features like SNPs would greatly increase the memory footprint. So this would be an argument for enabling an on-disk backend for the grounding terms as a solution here.

Some of the tools @ravwojdyla mentioned have similar APIs to a python dict, right? Such that support shouldn't be too burdensome.

bgyori commented 2 years ago

HI @ravwojdyla, I am in fact looking into the issue, not dismissing it, but want to make sure I highlight the option of having a single Gilda service running that multiple other processes can communicate with. I still think there might be a misunderstanding with respect to this since I don't think "this it's significantly more complicated setup than using gilda as a pure py library". For instance, wherever your process is calling

matches = gilda.ground('melanoma')

you could instead call

matches = requests.post(gilda_url, 'melanoma').json()

to get the exact same result - in the first case represented as Python objects, in the second, as JSON.

ravwojdyla commented 2 years ago

I still think there might be a misunderstanding with respect to this since I don't think "this it's significantly more complicated setup than using gilda as a pure py library".

@bgyori thank you for double checking, I appreciate that. No misunderstanding, all make sense, a "simple" requests.post introduces a array of potential issues tho:

Doing HTTP requests (over the Internet) introduces latency, failure recovery and (arguable small) cost.

To be fair, this is definitely a viable solution for manual or handful of requests. In some cases we run grounding on millions of terms, in which case we need to worry about latency/failures etc. Does that make sense?

ravwojdyla commented 2 years ago

Some of the tools @ravwojdyla mentioned have similar APIs to a python dict, right?

@dhimmel correct, apart from sqlite3 all of them will have dict-like API. Also good point about future growth and use cases.

bgyori commented 2 years ago

I started first with dbm and shelve. These two are very similar with the main difference being that dbm is limited to string values whereas shelve can represent any complex type as value. Therefore, shelve seems to be the right fit for our case given that the grounding dictionary's values are lists of Term objects.

I first made a gilda.shelve resource file as follows:

from gilda.api import grounder
gr = grounder.get_grounder() # This loads the default grounding terms from the TSV
with shelve.open('gilda.shelve') as db:
    for norm_text, terms in gr.entries.items():
        db[norm_text] = terms

Resource file size The resulting gilda.shelve file is 737 MB, which is much larger than the current default grounding_terms.tsv resource file (221 MB uncompressed, 34 MB compressed).

Startup time Startup time with shelve (i.e., doing shelve.open('gilda.shelve', 'r')) is around 7s compared to 14s with the approach we use for loading the TSV resource file fully into memory as a dict of lists of Terms.

Memory usage Using memray for profiling, loading a Grounder instance without doing any further operations with shelve uses 966 MB vs 2.6 GB with the original approach (not sure why I am seeing 2.6GB vs the 1.5GB mentioned above). Though the memory usage difference gets smaller after using the Grounder on ~86k benchmark strings, there is still a difference of 1.2 GB at the end of the benchmark.

Performance On a benchmark set of ~86k strings, the old grounder is about 38% faster than the one using shelve (~18k vs ~13k groundings per second).

Overall, what we see is that with shelve we have faster startup and lower memory usage, but a larger resource file, and slower performance. @ravwojdyla, @dhimmel looking at the quantitative comparison, what do you think about this "tradeoff profile"?

dhimmel commented 2 years ago

Thanks @bgyori for this nice profiling. I could see the 737 MB file size to be problematic in transit. What does this compress to? And how long does it take to convert grounding_terms.tsv to a shelve? Perhaps you only would need to distribute grounding_terms.tsv.gz and the shelve could be generated upon first use.

I wonder if the writeback=False argument when reading the shelve would help with the memory bloat:

shelve.open('gilda.shelve', flag='r', writeback=False)

Curious as to whether @ravwojdyla thinks shelve is the right solution given these results.

ravwojdyla commented 2 years ago

@bgyori thanks for a nice writeup and looking into dbm and shelve!

Using memray for profiling, loading a Grounder instance without doing any further operations with shelve uses 966 MB vs 2.6 GB with the original approach

This seems a bit fishy, I wonder where does the extra memory come from 🤔 I tried running similar experiment here's the memray command:

python -m memray run --live test_gilda.py

And you can see the

test_gilda.py is:

import shelve
db = shelve.open('gilda.shelve', flag="r")

import time
time.sleep(100000)

this is on Debian box, Python 3.9.7. And result in ~370MB, original approach still around 1.5GB.

image vs image

Overall, what we see is that with shelve we have faster startup and lower memory usage, but a larger resource file, and slower performance.

The file size is definitely larger than I would expect. Tho it compresses pretty well to ~56MB.

bgyori commented 2 years ago
python -m memray run --live test_gilda.py

test_gilda.py is:

import shelve
db = shelve.open('gilda.shelve', flag="r")

import time
time.sleep(100000)

This for me produces 737 MB for Max heap size seen.

The file size is definitely larger than I would expect. Tho it compresses pretty well to ~56MB.

I tried gzipping it and I actually get 129 MB. Also, I would probably have to try to see how reading from a gzipped shelve during runtime changes the benchmarks.

ravwojdyla commented 2 years ago

@bgyori now thinking about it, maybe we can actually build our own terms using an index that best fits our use case and pass that into Grounder(terms=<OUR_DICT_LIKE_TERMS_DB>). Is there something that would prevent us from easily doing this?

ravwojdyla commented 2 years ago

This for me produces 737 MB for Max heap size seen.

I tried gzipping it and I actually get 129 MB. Also, I would probably have to try to see how reading from a gzipped shelve during runtime changes the benchmarks.

@bgyori oh maybe this is different gilda version, we are still on 0.6.1, you are probably using latest.

EDIT: nope, with gilda 0.9.0 based shelve I see:

image

EDIT 2: I wonder where do these difference come from between our machine:

> gzip -k gilda.shelve.dat

> ls -lah gilda.shelve.dat*
-rw-r--r-- 1 rav rav 819M Jun  6 23:39 gilda.shelve.dat
-rw-r--r-- 1 rav rav  56M Jun  6 23:39 gilda.shelve.dat.gz

> gzip --version
gzip 1.10

Which database is being used on your end? See:

dbm.whichdb("gilda.shelve")  #  => 'dbm.dumb'
bgyori commented 2 years ago

Oh okay, mine says dbm.gnu so somehow we ended up with different backends. Still, unless I'm missing something, you cannot really read directly from a gzipped shelve file right?

Also, a remark on startup times: I measured this in the context of overall Gilda startup which also has to load disambiguation models, and the 7s mostly accounts for that part, independent of the main grounding terms.

dhimmel commented 2 years ago

you cannot really read directly from a gzipped shelve file right

Correct. The shelve cannot be compressed while reading or writing it. I was suggesting compressing it at the location where you distribute it, such that the user doesn't have to download such a large file. But it would be decompressed locally before use.

we can actually build our own terms using an index that best fits our use case

Some discussion of creating subsets of the grounding terms at https://github.com/indralab/gilda/issues/63. There's a namespaces option for gilda.ground now, but perhaps a namespace option when loading the term set would make it so we're only loading a small portion of the terms into memory. @bgyori how easy would it be to apply namespaces to a Grounder to only load a subset of the grounding terms?

bgyori commented 2 years ago

Next up is sqlite3. I constructed the db as follows:

import sqlite3
import tqdm

db = 'gilda.db'
conn = sqlite3.connect(db)
cur = conn.cursor()
q = """CREATE TABLE terms (
           norm_text text not null primary key,
           terms text
       )"""
cur.execute(q)

for norm_text, terms in tqdm.tqdm(gr.entries.items()):
    q = """INSERT INTO terms (norm_text, terms) VALUES (?, ?)"""
    cur.execute(q, (norm_text, json.dumps([t.to_json() for t in terms])))

q = """CREATE INDEX norm_index ON terms (norm_text);"""
cur.execute(q)

I implemented a wrapper to make it fit in seamlessly as the Grounder class' entries attribute (at least as far as simple grounding goes):

class SqliteEntities:
    def __init__(self, db):
        self.db = db
        self.conn = sqlite3.connect(self.db)

    def get(self, key, default=None):
        res = self.conn.execute("SELECT terms FROM terms WHERE norm_text=?", (key,))
        result = res.fetchone()
        if not result:
            return default
        return [Term(**j) for j in json.loads(result[0])]

Resource file size The resulting gilda.db file is 646 MB, again much larger than the original tsv(.gz).

Startup time Again this is around 7 seconds but since connecting to the sqlite DB is instantaneous, this is all for other initialization steps when instantiating a Grounder (notably loading disambiguation models). The same applies to the shelve evaluation above.

Memory usage Loading the grounder uses 966 MB (exactly the same as with shelve), and after grounding 86k strings, it still uses about 1.6 GB less RAM than the original and about 425 MB less than shelve.

Performance Using sqlite is slower, though not by much, compared to shelve and the original approach. It can still ground around 11k strings / s which makes the original 64% faster.

So overall, in this case we still have a much larger resource file but otherwise we get better memory usage and slower performance compared to shelve. One advantage of sqlite is that it is better portable than shelve. A disadvantage is that further wrapper methods e.g., values() would have to be implemented to make the sqlite backend behave like a dict to support all the ways in which the Grounder uses its entries.

dhimmel commented 2 years ago

The portability of sqlite is nice. Could be used by other languages like R or Julia in the future. I could see terms becoming its own table rather than a serialized JSON field. Would open the door to more advanced SQL queries on the dataset.

Loading the grounder uses 966 MB (exactly the same as with shelve)

Surprised this is so high. Are there other things besides grounding terms that could take up a lot of memory?

bgyori commented 2 years ago

Surprised this is so high. Are there other things besides grounding terms that could take up a lot of memory?

Note entirely sure, my guess is the disambiguation models that are loaded. I'm just looking at overall memory usage of getting/using a Grounder instance, not just using the DB backend in isolation.

dhimmel commented 2 years ago

my guess is the disambiguation models that are loaded

Looks like they all get loaded at once and stored in Grounder.gilda_disambiguators:

https://github.com/indralab/gilda/blob/6255ec004e3c135c5cf01afcffe292e125d4de96/gilda/grounder.py#L584-L589

This actually seems like a low hanging fruit for memory and startup optimization. Could use a shelve instead of pickle containing all models. And then just load the disambiguator in a lazy way.

ravwojdyla commented 2 years ago

@bgyori thanks for sqlite3 stats, cool to see them.

Using sqlite is slower, though not by much, compared to shelve and the original approach. It can still ground around 11k strings / s which makes the original 64% faster.

And earlier:

On a benchmark set of ~86k strings, the old grounder is about 38% faster than the one using shelve (~18k vs ~13k groundings per second).

There's discrepancy between these two comments. Are you sure you meant 64% faster in sqlite case?

Regarding the original list

dbm or shelve, but leveldb, sparkey, sqlite3 etc.

leveldb and sparkey are not really an option for gilda's internal index since they require extra sys dependencies. A user (like me), might still be able to mint an index using those and https://github.com/indralab/gilda/issues/95#issuecomment-1148057496. All that said I would add one more option for benchmark: lmdb, I like that it has shared memory mapping built-in (sparkey index is loaded in a similar way). Which would work nicely with in a multi-process setup because a single memory region (with the DB memory mapped) can be used across many processes (without extra cost of memory per process, which is really the original problem for us). One caveat right now tho is that lmdb by default supports keys up to 511 bytes, there are 3k keys (in the grounding terms, out of 1.6M) longer than that, so we would need to handle that. At least in my test, the db file ends up:

> du -sh gilda.lmdb/data.mdb*
718M    gilda.lmdb/data.mdb
79M     gilda.lmdb/data.mdb.gz

Here's some code to create the db:

import lmdb
import json
from gilda.api import grounder

gr = grounder.get_grounder() # This loads the default grounding terms from the TSV
long_keys = 0

env = lmdb.open("gilda.lmdb", map_size=int(10.147e+9))
with env.begin(write=True) as txn:
    for norm_text, terms in gr.entries.items():
        # NOTE: not handling these right now
        if len(norm_text) > 500:
            long_keys = long_keys + 1
            continue
        txn.put(norm_text.encode(), json.dumps([t.to_json() for t in terms]).encode())

Btw regarding performance hit by switching from pre-process in-memory representation, at least in our multiprocess setup it's acceptable because due to current memory footprint (~1.5GB/process) we are forced to reduce the number of processes by ~50%. So say we have a VM with 16 CPUs, and run parallel gilda grounding using Spark, in theory we should be able to use 16 process, but we are forced to reduce that to 8, which is already 50% perf hit (with 50% smaller memory footprint). If we can use something that is memory efficient/mapped, with say ~40% perf hit (I'm assuming cold cache) 16 * 0.6 = 9.6 is still better for worst case, and ~90% smaller memory footprint. And this should get better with warm cache (perf-wise).

bgyori commented 2 years ago

The rough performance numbers I gave seem consistent to me: 18k/13k=1.38 (original/shelve groundings per second) and 18k/11k=1.64 (original/sqlite groundings per second), though perhaps this is not the best way to report these percentages.

Since we started this thread I tried to more carefully decompose where memory usage comes from when using Gilda as a whole (not parts of it in isolation).

  1. The main grounding entries data structure, currently represented as a dict of normalized text as keys and lists of Terms as values in memory. This takes around 1.5GB of RAM. Currently, I would be most comfortable with supporting sqlite as an optional back-end to effectively eliminate this component of memory usage while taking some hit in performance.
  2. The Gilda disambiguation models (ones built using and distributed with Gilda) use around 250MB of RAM when loaded into memory. The easiest solution here would be to implement lazy loading. This would mean that if context is not passed to grounding, these models would never be loaded, and even if context is used for grounding, only those models would be loaded that are actually necessary.
  3. The Adeft disambiguation models (that are built and distributed by a separate package) are currently also loaded when the Grounder is first instantiated and use around 540MB of RAM. Again we can turn this into lazy loading where if context is not used for grounding, these would never be loaded into memory.

Having said this, @ravwojdyla, it might be useful for this discussion if you described a bit what you are trying to ground. Is it a restricted set of entities like human genes? Or can the strings you are grounding be small molecules, diseases, etc.? If you are dealing with a restricted set, you can easily create your own custom grounder instance that can potentially be orders of magnitude smaller than what you need for the general default setting that Gilda supports.

ravwojdyla commented 2 years ago

The rough performance numbers I gave seem consistent to me: 18k/13k=1.38 (original/shelve groundings per second) and 18k/11k=1.64 (original/sqlite groundings per second), though perhaps this is not the best way to report these percentages.

Ah, I see, my bad I misunderstood the interpretation of those.

Currently, I would be most comfortable with supporting sqlite as an optional back-end to effectively eliminate this component of memory usage while taking some hit in performance.

I'm curious if there's any chance to give lmdb a quick try in your benchmark please? I'm very curious to see the performance hit there. Also asking because it would be optimal for our multi-process use case but could also be a reasonable solution for lazy loaded single process setup.

Having said this, @ravwojdyla, it might be useful for this discussion if you described a bit what you are trying to ground. Is it a restricted set of entities like human genes? Or can the strings you are grounding be small molecules, diseases, etc.?

I think we are currently mostly interested in human genes, proteins and diseases. Is there anything I'm missing @dhimmel?

bgyori commented 2 years ago

I implemented #96 and #97 which I believe fully resolve this issue, and if used, result in virtually no startup time and minimal memory usage. Unfortunately I don't think I can commit to investigating further options at this time.

ravwojdyla commented 2 years ago

@bgyori sounds great, thank you so much!

Unfortunately I don't think I can commit to investigating further options at this time.

Out of curiosity, if we create a PR with lmdb or similar backend using #97 as a blueprint, would you be interesting in merging that or should we keep that in-house?

Edit: also do you have the benchmark code used above somewhere please?

bgyori commented 2 years ago

The sqlite solution does the job with standard built-in libraries so I think this will be fine for now, but of course all the code I provided should help you make further adaptations for highly specialized settings. Various benchmark scripts can be found at: https://github.com/indralab/gilda/tree/master/benchmarks that you can adapt to your purposes as needed.

ravwojdyla commented 2 years ago

Various benchmark scripts can be found at: https://github.com/indralab/gilda/tree/master/benchmarks that you can adapt to your purposes as needed.

@bgyori thanks, and which benchmark did you use for this issue and performance stats above? All benchmarks in that directory seem at least a month old.

bgyori commented 2 years ago

I didn't implement my tests as new benchmark scripts. I just ran memray manually to check usage, and ran the existing BioID benchmark to assess performance.

pablogsal commented 2 years ago

👋 Hi @bgyori,

I am one of the authors of memray. We are collecting success stories here. If you have a minute, do you mind leaving a short message on how memray help with this issue? Knowing how we managed to help will help us track trends and target areas for improvement, prioritize new features and development and identify potential bugs or areas of confusion.

Thanks a lot for your consideration and for helping us improve the profiler :)

bgyori commented 2 years ago

Thanks @pablogsal, I wrote a message!