Closed fmunch closed 4 years ago
Thanks for the PR. When I was originally writing this, I was generally trying to find a reasonable balance between memory consumption and time. I fear caching in this way may push things too far in the memory consumption direction (though I haven't measured things, so it may not be anywhere near as bad as I think).
Perhaps one other approach to dealing with this problem is to adjust yieldByCodepoints
to re-use the the current file as much as possible. Whilst that may feel like it won't help that much, I suspect many blocks have characters clustered quite close together, so you'd see some sort of improvement. Maybe something like this?
/**
* @param Codepoint\Collection $codepoints
* @return \Generator
*/
private function yieldByCodepoints(Codepoint\Collection $codepoints)
{
$file = null;
$characters = [];
foreach ($codepoints as $codepoint) {
try {
$required = $this->getFileByCodepoint($codepoint);
} catch (CharacterNotFoundException $e) {
// Missing characters are skipped
continue;
}
if ($required !== $file) {
$file = $required;
$characters = $required->read();
}
$value = $codepoint->getValue();
if (array_key_exists($value, $characters)) {
yield $this->serializer->unserialize($characters[$value]);
}
}
}
If you wouldn't mind experimenting a bit around that, that would be much appreciated!
On further consideration, perhaps having using an LRU cache would work well also.. that way we can limit memory consumption, but still reap the benefits to some degree.
Thank you for the quick response.
You are right, caching everything is too aggressive. But a fixed size cache should work pretty well. Even a size of 1 covers most use cases. Traversing Database::fromDisk()->onlyCharacters()
stays below 35 MB peak memory usage.
Sorry for the delay with this. Merging now - and I'll push some changes to master after that covert this into an LRU style cache, since it's not particularly difficult to do & prevents any chance of excessive memory usage :+1:
When reading all characters from a block, the method
FileRepository::yieldByCodepoints
reads one file for every character in the block. The code below takes 23s. Adding a file cache gets this down to 0.29s.