linkml / linkml-runtime

Runtime support for linkml generated models
https://linkml.io/linkml/
Creative Commons Zero v1.0 Universal
22 stars 21 forks source link

Update Cache To Support > 128 Entries #309

Closed timothy-trinidad-ps closed 3 months ago

timothy-trinidad-ps commented 4 months ago

Fixes https://github.com/linkml/linkml/issues/1973.

I'm starting with a really naive fix of just removing the entry limit for the lru_cache, but I'm open to pivoting to other solutions.

With this change, running gen-doc on the following yaml file goes down from 3+ minutes to 7s.

test.yaml.txt

codecov[bot] commented 4 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 62.88%. Comparing base (27b9158) to head (0be1913). Report is 3 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #309 +/- ## ======================================= Coverage 62.88% 62.88% ======================================= Files 62 62 Lines 8528 8528 Branches 2436 2436 ======================================= Hits 5363 5363 Misses 2554 2554 Partials 611 611 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

sierra-moxon commented 4 months ago

@timothy-trinidad-ps - looking lru doc, it seems like the default is "128" items in the cache and this PR takes it to "limitless" items in the cache.

Do you think it would be helpful to take this item-number up in increments of 100 or so until we see a significant speedup without going limitless?

timothy-trinidad-ps commented 4 months ago

I'm open to it, but any other number seemed just as arbitrary to me as 128. If I up it to 256 I know that we will run into this problem again in the next 6 months (we're in the process of annotating our database tables).

The obvious downside to this would be memory leaks for very large schemas, but I'm not entirely sure how to measure that. I think memory usage growing proportional to the schema size would be more expected than an arbitrary threshold after which the program effectively stops working.

sierra-moxon commented 4 months ago

I can agree with that :) - is there anything we can do to prevent a very large schema from running into problems with too big a cache? Said another way, what is the largest schema size you've tested this on so far?

timothy-trinidad-ps commented 4 months ago

Just this 129. I can automate creating a schema with 1024 tables, but the specific use case I'm testing with is with gen-doc and the functions called by its dependent templates. What I would need help with is determining how to test other use cases of schemaview.py when it comes to memory usage.

timothy-trinidad-ps commented 4 months ago

I just ran a benchmark for a file with 2,048 classes and 10,240 slots - it took about 4 minutes with a peak of about 450 MB memory (at least for the gen-doc use case). test.yaml.txt

Python_47340