medkit-lib / medkit

Toolkit for a learning health system
https://medkit-lib.org/
MIT License
15 stars 8 forks source link

limit in the size of the simstring db #52

Closed coulet closed 3 months ago

coulet commented 4 months ago

Dear medkit team, I noticed an issue with the size of MRCONSO.RFF file when using the UMLS Matcher. It is fine when the file is small, but there I have an error with large ones. And MRCONSO can be very large. For my test, it works with a 50Mb file (500000), but not with a 107Mb (1000000 lines) file. It seems to be an issue when building the simstring database. Could we increase this limit size easily? Thanks++ Adrien

I have : HASH: Out of overflow pages. Increase page size

[...]

File ~/miniconda3/envs/medkit/lib/python3.8/site-packages/medkit/text/ner/umls_matcher.py:211, in UMLSMatcher.init(self, umls_dir, cache_dir, language, threshold, min_length, max_length, similarity, lowercase, normalize_unicode, spacy_tokenization, semgroups, blacklist, same_beginning, output_labels_by_semgroup, attrs_to_copy, name, uid) 201 logger.info("Building simstring database from UMLS terms, this may take a while") 202 rules = self._build_rules( 203 umls_dir, 204 language, (...) 208 labels_by_semgroup, 209 ) --> 211 build_simstring_matcher_databases(simstring_db_file, rules_db_file, rules) 213 with cache_params_file.open(mode="w") as fp: 214 yaml.safe_dump(dataclasses.asdict(cache_params), fp)

File ~/miniconda3/envs/medkit/lib/python3.8/site-packages/medkit/text/ner/_base_simstring_matcher.py:416, in build_simstring_matcher_databases(simstring_db_file, rules_db_file, rules) 414 rules_db[term_to_match].append(rule) 415 simstring_db_writer.close() --> 416 rules_db.sync() 417 rules_db.close()

File ~/miniconda3/envs/medkit/lib/python3.8/shelve.py:168, in Shelf.sync(self) 166 self.writeback = False 167 for key, entry in self.cache.items(): --> 168 self[key] = entry 169 self.writeback = True 170 self.cache = {}

File ~/miniconda3/envs/medkit/lib/python3.8/shelve.py:125, in Shelf.setitem(self, key, value) 123 p = Pickler(f, self._protocol) 124 p.dump(value) --> 125 self.dict[key.encode(self.keyencoding)] = f.getvalue()

error: cannot add item to database

ghisvail commented 4 months ago

HASH: Out of overflow pages. Increase page size is often associated with out-of-memory errors.

Did you notice a significant increase of RAM usage whilst the simstring db is being built?

Based on a quick read of the impacted code, I have a feeling we are being hit by the case explained in the last paragraph of shelve.open. That part needs to be refactored to avoid mutability patterns as much as possible.