Closed ravwojdyla closed 2 years ago
Hi @ravwojdyla, thank you for the comment and suggestions! Before going into pros and cons, I wanted to ask, what approach do you use for memory profiling so I can also compare options using the same metric?
👋 @bgyori you can use memray for example.
@bgyori is there anything else that I could assist in?
Hi @ravwojdyla, I started experimenting with different options and weighing advantages and disadvantages. However, I would generally recommend that if memory usage and startup time is an issue, that you run Gilda as a (local) web service that each of your parallel processes can communicate with. This is how we run Gilda in some applications (e.g., parallel dialogue sessions) where we don't want each individual process to maintain its own instance. Is this not a viable option in your case?
Is this not a viable option in your case?
@bgyori thanks for the update. In our context we run gilda inference as part of Spark task, setting up a web server as part of Spark is certainly feasible, but far from straightforward (handling the ops + remote comms). Especially if you compare it to using it like a library or other model inference use-cases. I would imagine that anyone using gilda from a parallel processing framework will run into this issue. Does that make sense?
I see, let me try to press on this a bit more though. In principle, your nodes could communicate with the public web service running at http://grounding.indra.bio through HTTP requests instead of using Python library calls. What I'm suggesting is just that you could have a running web service instance of Gilda on some arbitrary local infrastructure (possibly independent of Spark), the only thing being required is that your Spark processes be able to send requests to it through a URL. This can also be done via Docker to avoid having to have a local Python environment configured (https://github.com/indralab/gilda#run-web-service-with-docker).
@bgyori sure all that is in theory feasible but I hope it's clear this it's significantly more complicated setup than using gilda as a pure py library? Re having our own Gilda server introduces state which requires babysitting. Doing HTTP requests (over the Internet) introduces latency, failure recovery and (arguable small) cost. OOC is this issue/use-case not something you intend to support out of the box?
Thinking about the future, it's reasonable to expect grounding terms to grow in size as more resources and entity types are supported. I can imagine tagging support for genomic features like SNPs would greatly increase the memory footprint. So this would be an argument for enabling an on-disk backend for the grounding terms as a solution here.
Some of the tools @ravwojdyla mentioned have similar APIs to a python dict, right? Such that support shouldn't be too burdensome.
HI @ravwojdyla, I am in fact looking into the issue, not dismissing it, but want to make sure I highlight the option of having a single Gilda service running that multiple other processes can communicate with. I still think there might be a misunderstanding with respect to this since I don't think "this it's significantly more complicated setup than using gilda as a pure py library". For instance, wherever your process is calling
matches = gilda.ground('melanoma')
you could instead call
matches = requests.post(gilda_url, 'melanoma').json()
to get the exact same result - in the first case represented as Python objects, in the second, as JSON.
I still think there might be a misunderstanding with respect to this since I don't think "this it's significantly more complicated setup than using gilda as a pure py library".
@bgyori thank you for double checking, I appreciate that. No misunderstanding, all make sense, a "simple" requests.post
introduces a array of potential issues tho:
Doing HTTP requests (over the Internet) introduces latency, failure recovery and (arguable small) cost.
To be fair, this is definitely a viable solution for manual or handful of requests. In some cases we run grounding on millions of terms, in which case we need to worry about latency/failures etc. Does that make sense?
Some of the tools @ravwojdyla mentioned have similar APIs to a python dict, right?
@dhimmel correct, apart from sqlite3 all of them will have dict-like API. Also good point about future growth and use cases.
I started first with dbm
and shelve
. These two are very similar with the main difference being that dbm
is limited to string values whereas shelve
can represent any complex type as value. Therefore, shelve
seems to be the right fit for our case given that the grounding dictionary's values are lists of Term objects.
I first made a gilda.shelve
resource file as follows:
from gilda.api import grounder
gr = grounder.get_grounder() # This loads the default grounding terms from the TSV
with shelve.open('gilda.shelve') as db:
for norm_text, terms in gr.entries.items():
db[norm_text] = terms
Resource file size
The resulting gilda.shelve
file is 737 MB, which is much larger than the current default grounding_terms.tsv
resource file (221 MB uncompressed, 34 MB compressed).
Startup time
Startup time with shelve
(i.e., doing shelve.open('gilda.shelve', 'r')
) is around 7s compared to 14s with the approach we use for loading the TSV resource file fully into memory as a dict of lists of Terms.
Memory usage
Using memray for profiling, loading a Grounder instance without doing any further operations with shelve
uses 966 MB vs 2.6 GB with the original approach (not sure why I am seeing 2.6GB vs the 1.5GB mentioned above). Though the memory usage difference gets smaller after using the Grounder on ~86k benchmark strings, there is still a difference of 1.2 GB at the end of the benchmark.
Performance
On a benchmark set of ~86k strings, the old grounder is about 38% faster than the one using shelve
(~18k vs ~13k groundings per second).
Overall, what we see is that with shelve
we have faster startup and lower memory usage, but a larger resource file, and slower performance. @ravwojdyla, @dhimmel looking at the quantitative comparison, what do you think about this "tradeoff profile"?
Thanks @bgyori for this nice profiling. I could see the 737 MB file size to be problematic in transit. What does this compress to? And how long does it take to convert grounding_terms.tsv
to a shelve? Perhaps you only would need to distribute grounding_terms.tsv.gz
and the shelve could be generated upon first use.
I wonder if the writeback=False
argument when reading the shelve would help with the memory bloat:
shelve.open('gilda.shelve', flag='r', writeback=False)
Curious as to whether @ravwojdyla thinks shelve is the right solution given these results.
@bgyori thanks for a nice writeup and looking into dbm
and shelve
!
Using memray for profiling, loading a Grounder instance without doing any further operations with shelve uses 966 MB vs 2.6 GB with the original approach
This seems a bit fishy, I wonder where does the extra memory come from 🤔 I tried running similar experiment here's the memray command:
python -m memray run --live test_gilda.py
And you can see the
test_gilda.py
is:
import shelve
db = shelve.open('gilda.shelve', flag="r")
import time
time.sleep(100000)
this is on Debian box, Python 3.9.7. And result in ~370MB, original approach still around 1.5GB.
vs
Overall, what we see is that with shelve we have faster startup and lower memory usage, but a larger resource file, and slower performance.
The file size is definitely larger than I would expect. Tho it compresses pretty well to ~56MB.
python -m memray run --live test_gilda.py
test_gilda.py
is:import shelve db = shelve.open('gilda.shelve', flag="r") import time time.sleep(100000)
This for me produces 737 MB for Max heap size seen.
The file size is definitely larger than I would expect. Tho it compresses pretty well to ~56MB.
I tried gzipping it and I actually get 129 MB. Also, I would probably have to try to see how reading from a gzipped shelve during runtime changes the benchmarks.
@bgyori now thinking about it, maybe we can actually build our own terms
using an index that best fits our use case and pass that into Grounder(terms=<OUR_DICT_LIKE_TERMS_DB>)
. Is there something that would prevent us from easily doing this?
This for me produces 737 MB for Max heap size seen.
I tried gzipping it and I actually get 129 MB. Also, I would probably have to try to see how reading from a gzipped shelve during runtime changes the benchmarks.
@bgyori oh maybe this is different gilda version, we are still on 0.6.1, you are probably using latest.
EDIT: nope, with gilda 0.9.0 based shelve I see:
EDIT 2: I wonder where do these difference come from between our machine:
> gzip -k gilda.shelve.dat
> ls -lah gilda.shelve.dat*
-rw-r--r-- 1 rav rav 819M Jun 6 23:39 gilda.shelve.dat
-rw-r--r-- 1 rav rav 56M Jun 6 23:39 gilda.shelve.dat.gz
> gzip --version
gzip 1.10
Which database is being used on your end? See:
dbm.whichdb("gilda.shelve") # => 'dbm.dumb'
Oh okay, mine says dbm.gnu
so somehow we ended up with different backends. Still, unless I'm missing something, you cannot really read directly from a gzipped shelve file right?
Also, a remark on startup times: I measured this in the context of overall Gilda startup which also has to load disambiguation models, and the 7s mostly accounts for that part, independent of the main grounding terms.
you cannot really read directly from a gzipped shelve file right
Correct. The shelve cannot be compressed while reading or writing it. I was suggesting compressing it at the location where you distribute it, such that the user doesn't have to download such a large file. But it would be decompressed locally before use.
we can actually build our own terms using an index that best fits our use case
Some discussion of creating subsets of the grounding terms at https://github.com/indralab/gilda/issues/63. There's a namespaces option for gilda.ground
now, but perhaps a namespace option when loading the term set would make it so we're only loading a small portion of the terms into memory. @bgyori how easy would it be to apply namespaces to a Grounder
to only load a subset of the grounding terms?
Next up is sqlite3
. I constructed the db as follows:
import sqlite3
import tqdm
db = 'gilda.db'
conn = sqlite3.connect(db)
cur = conn.cursor()
q = """CREATE TABLE terms (
norm_text text not null primary key,
terms text
)"""
cur.execute(q)
for norm_text, terms in tqdm.tqdm(gr.entries.items()):
q = """INSERT INTO terms (norm_text, terms) VALUES (?, ?)"""
cur.execute(q, (norm_text, json.dumps([t.to_json() for t in terms])))
q = """CREATE INDEX norm_index ON terms (norm_text);"""
cur.execute(q)
I implemented a wrapper to make it fit in seamlessly as the Grounder class' entries attribute (at least as far as simple grounding goes):
class SqliteEntities:
def __init__(self, db):
self.db = db
self.conn = sqlite3.connect(self.db)
def get(self, key, default=None):
res = self.conn.execute("SELECT terms FROM terms WHERE norm_text=?", (key,))
result = res.fetchone()
if not result:
return default
return [Term(**j) for j in json.loads(result[0])]
Resource file size
The resulting gilda.db
file is 646 MB, again much larger than the original tsv(.gz).
Startup time
Again this is around 7 seconds but since connecting to the sqlite DB is instantaneous, this is all for other initialization steps when instantiating a Grounder (notably loading disambiguation models). The same applies to the shelve
evaluation above.
Memory usage Loading the grounder uses 966 MB (exactly the same as with shelve), and after grounding 86k strings, it still uses about 1.6 GB less RAM than the original and about 425 MB less than shelve.
Performance Using sqlite is slower, though not by much, compared to shelve and the original approach. It can still ground around 11k strings / s which makes the original 64% faster.
So overall, in this case we still have a much larger resource file but otherwise we get better memory usage and slower performance compared to shelve. One advantage of sqlite is that it is better portable than shelve. A disadvantage is that further wrapper methods e.g., values()
would have to be implemented to make the sqlite backend behave like a dict to support all the ways in which the Grounder uses its entries.
The portability of sqlite is nice. Could be used by other languages like R or Julia in the future. I could see terms becoming its own table rather than a serialized JSON field. Would open the door to more advanced SQL queries on the dataset.
Loading the grounder uses 966 MB (exactly the same as with shelve)
Surprised this is so high. Are there other things besides grounding terms that could take up a lot of memory?
Surprised this is so high. Are there other things besides grounding terms that could take up a lot of memory?
Note entirely sure, my guess is the disambiguation models that are loaded. I'm just looking at overall memory usage of getting/using a Grounder
instance, not just using the DB backend in isolation.
my guess is the disambiguation models that are loaded
Looks like they all get loaded at once and stored in Grounder.gilda_disambiguators
:
This actually seems like a low hanging fruit for memory and startup optimization. Could use a shelve instead of pickle containing all models. And then just load the disambiguator in a lazy way.
@bgyori thanks for sqlite3 stats, cool to see them.
Using sqlite is slower, though not by much, compared to shelve and the original approach. It can still ground around 11k strings / s which makes the original 64% faster.
And earlier:
On a benchmark set of ~86k strings, the old grounder is about 38% faster than the one using shelve (~18k vs ~13k groundings per second).
There's discrepancy between these two comments. Are you sure you meant 64% faster in sqlite case?
Regarding the original list
leveldb and sparkey are not really an option for gilda's internal index since they require extra sys dependencies. A user (like me), might still be able to mint an index using those and https://github.com/indralab/gilda/issues/95#issuecomment-1148057496. All that said I would add one more option for benchmark: lmdb, I like that it has shared memory mapping built-in (sparkey index is loaded in a similar way). Which would work nicely with in a multi-process setup because a single memory region (with the DB memory mapped) can be used across many processes (without extra cost of memory per process, which is really the original problem for us). One caveat right now tho is that lmdb by default supports keys up to 511 bytes, there are 3k keys (in the grounding terms, out of 1.6M) longer than that, so we would need to handle that. At least in my test, the db file ends up:
> du -sh gilda.lmdb/data.mdb*
718M gilda.lmdb/data.mdb
79M gilda.lmdb/data.mdb.gz
Here's some code to create the db:
import lmdb
import json
from gilda.api import grounder
gr = grounder.get_grounder() # This loads the default grounding terms from the TSV
long_keys = 0
env = lmdb.open("gilda.lmdb", map_size=int(10.147e+9))
with env.begin(write=True) as txn:
for norm_text, terms in gr.entries.items():
# NOTE: not handling these right now
if len(norm_text) > 500:
long_keys = long_keys + 1
continue
txn.put(norm_text.encode(), json.dumps([t.to_json() for t in terms]).encode())
Btw regarding performance hit by switching from pre-process in-memory representation, at least in our multiprocess setup it's acceptable because due to current memory footprint (~1.5GB/process) we are forced to reduce the number of processes by ~50%. So say we have a VM with 16 CPUs, and run parallel gilda grounding using Spark, in theory we should be able to use 16 process, but we are forced to reduce that to 8, which is already 50% perf hit (with 50% smaller memory footprint). If we can use something that is memory efficient/mapped, with say ~40% perf hit (I'm assuming cold cache) 16 * 0.6 = 9.6 is still better for worst case, and ~90% smaller memory footprint. And this should get better with warm cache (perf-wise).
The rough performance numbers I gave seem consistent to me: 18k/13k=1.38 (original/shelve groundings per second) and 18k/11k=1.64 (original/sqlite groundings per second), though perhaps this is not the best way to report these percentages.
Since we started this thread I tried to more carefully decompose where memory usage comes from when using Gilda as a whole (not parts of it in isolation).
Having said this, @ravwojdyla, it might be useful for this discussion if you described a bit what you are trying to ground. Is it a restricted set of entities like human genes? Or can the strings you are grounding be small molecules, diseases, etc.? If you are dealing with a restricted set, you can easily create your own custom grounder instance that can potentially be orders of magnitude smaller than what you need for the general default setting that Gilda supports.
The rough performance numbers I gave seem consistent to me: 18k/13k=1.38 (original/shelve groundings per second) and 18k/11k=1.64 (original/sqlite groundings per second), though perhaps this is not the best way to report these percentages.
Ah, I see, my bad I misunderstood the interpretation of those.
Currently, I would be most comfortable with supporting sqlite as an optional back-end to effectively eliminate this component of memory usage while taking some hit in performance.
I'm curious if there's any chance to give lmdb a quick try in your benchmark please? I'm very curious to see the performance hit there. Also asking because it would be optimal for our multi-process use case but could also be a reasonable solution for lazy loaded single process setup.
Having said this, @ravwojdyla, it might be useful for this discussion if you described a bit what you are trying to ground. Is it a restricted set of entities like human genes? Or can the strings you are grounding be small molecules, diseases, etc.?
I think we are currently mostly interested in human genes, proteins and diseases. Is there anything I'm missing @dhimmel?
I implemented #96 and #97 which I believe fully resolve this issue, and if used, result in virtually no startup time and minimal memory usage. Unfortunately I don't think I can commit to investigating further options at this time.
@bgyori sounds great, thank you so much!
Unfortunately I don't think I can commit to investigating further options at this time.
Out of curiosity, if we create a PR with lmdb or similar backend using #97 as a blueprint, would you be interesting in merging that or should we keep that in-house?
Edit: also do you have the benchmark code used above somewhere please?
The sqlite solution does the job with standard built-in libraries so I think this will be fine for now, but of course all the code I provided should help you make further adaptations for highly specialized settings. Various benchmark scripts can be found at: https://github.com/indralab/gilda/tree/master/benchmarks that you can adapt to your purposes as needed.
Various benchmark scripts can be found at: https://github.com/indralab/gilda/tree/master/benchmarks that you can adapt to your purposes as needed.
@bgyori thanks, and which benchmark did you use for this issue and performance stats above? All benchmarks in that directory seem at least a month old.
I didn't implement my tests as new benchmark scripts. I just ran memray manually to check usage, and ran the existing BioID benchmark to assess performance.
👋 Hi @bgyori,
I am one of the authors of memray
. We are collecting success stories here. If you have a minute, do you mind leaving a short message on how memray
help with this issue? Knowing how we managed to help will help us track trends and target areas for improvement, prioritize new features and development and identify potential bugs or areas of confusion.
Thanks a lot for your consideration and for helping us improve the profiler :)
Thanks @pablogsal, I wrote a message!
👋 thank you for all your hard work.
It looks like gilda has a pretty high memory footprint about ~1.5G, most of that comes from the
load_terms_file
. Thegrounding_terms.tsv
file is about ~200M, compressed ~30M. For context the models viaload_gilda_models
take up about 256M of memory.The
gilda.get_grounder().entries
dict is close to 1.5G in memory, with 1.6 million keys, andTerm
objects as values all preloaded into memory.Term
object could be improved to be memory efficient e.g.dataclass
withslots
defined + more efficient types. All this unfortunately makes it problematic to use in a multi-process environment (e.g. pyspark), the startup is also slow since we need to create all term entries even if most of them might never be used.Fortunately this is a solved problem, if I may suggest, maybe the
grounding_terms
should be distributed as some kind of binary format/index, this could be any kind of (constant) KV store built for read heavy ops even builtin dbm or shelve, but also specialised leveldb, sparkey, sqlite3 etc.This way:
grounding_terms
should be shared memory loadedWhat do you think?