Closed coulet closed 3 months ago
HASH: Out of overflow pages. Increase page size
is often associated with out-of-memory errors.
Did you notice a significant increase of RAM usage whilst the simstring db is being built?
Based on a quick read of the impacted code, I have a feeling we are being hit by the case explained in the last paragraph of shelve.open. That part needs to be refactored to avoid mutability patterns as much as possible.
Dear medkit team, I noticed an issue with the size of MRCONSO.RFF file when using the UMLS Matcher. It is fine when the file is small, but there I have an error with large ones. And MRCONSO can be very large. For my test, it works with a 50Mb file (500000), but not with a 107Mb (1000000 lines) file. It seems to be an issue when building the simstring database. Could we increase this limit size easily? Thanks++ Adrien
I have : HASH: Out of overflow pages. Increase page size
[...]
File ~/miniconda3/envs/medkit/lib/python3.8/site-packages/medkit/text/ner/umls_matcher.py:211, in UMLSMatcher.init(self, umls_dir, cache_dir, language, threshold, min_length, max_length, similarity, lowercase, normalize_unicode, spacy_tokenization, semgroups, blacklist, same_beginning, output_labels_by_semgroup, attrs_to_copy, name, uid) 201 logger.info("Building simstring database from UMLS terms, this may take a while") 202 rules = self._build_rules( 203 umls_dir, 204 language, (...) 208 labels_by_semgroup, 209 ) --> 211 build_simstring_matcher_databases(simstring_db_file, rules_db_file, rules) 213 with cache_params_file.open(mode="w") as fp: 214 yaml.safe_dump(dataclasses.asdict(cache_params), fp)
File ~/miniconda3/envs/medkit/lib/python3.8/site-packages/medkit/text/ner/_base_simstring_matcher.py:416, in build_simstring_matcher_databases(simstring_db_file, rules_db_file, rules) 414 rules_db[term_to_match].append(rule) 415 simstring_db_writer.close() --> 416 rules_db.sync() 417 rules_db.close()
File ~/miniconda3/envs/medkit/lib/python3.8/shelve.py:168, in Shelf.sync(self) 166 self.writeback = False 167 for key, entry in self.cache.items(): --> 168 self[key] = entry 169 self.writeback = True 170 self.cache = {}
File ~/miniconda3/envs/medkit/lib/python3.8/shelve.py:125, in Shelf.setitem(self, key, value) 123 p = Pickler(f, self._protocol) 124 p.dump(value) --> 125 self.dict[key.encode(self.keyencoding)] = f.getvalue()
error: cannot add item to database