Improving FileRepository::yieldByCodepoints performances

fmunch commented 4 years ago

When reading all characters from a block, the method FileRepository::yieldByCodepoints reads one file for every character in the block. The code below takes 23s. Adding a file cache gets this down to 0.29s.

Database::fromDisk()->getByBlock(Block::fromValue('Cuneiform'))->getCharacters()
    ->traverseWith(function (Character $character) {
        echo $character->getGeneralProperties()->getNames()->getPrimary(), PHP_EOL;
    });

nick-jones commented 4 years ago

Thanks for the PR. When I was originally writing this, I was generally trying to find a reasonable balance between memory consumption and time. I fear caching in this way may push things too far in the memory consumption direction (though I haven't measured things, so it may not be anywhere near as bad as I think).

Perhaps one other approach to dealing with this problem is to adjust yieldByCodepoints to re-use the the current file as much as possible. Whilst that may feel like it won't help that much, I suspect many blocks have characters clustered quite close together, so you'd see some sort of improvement. Maybe something like this?

    /**
     * @param Codepoint\Collection $codepoints
     * @return \Generator
     */
    private function yieldByCodepoints(Codepoint\Collection $codepoints)
    {
        $file = null;
        $characters = [];

        foreach ($codepoints as $codepoint) {
            try {
                $required = $this->getFileByCodepoint($codepoint);
            } catch (CharacterNotFoundException $e) {
                // Missing characters are skipped
                continue;
            }

            if ($required !== $file) {
                $file = $required;
                $characters = $required->read();
            }

            $value = $codepoint->getValue();
            if (array_key_exists($value, $characters)) {
                yield $this->serializer->unserialize($characters[$value]);
            }
        }
    }

If you wouldn't mind experimenting a bit around that, that would be much appreciated!

nick-jones commented 4 years ago

On further consideration, perhaps having using an LRU cache would work well also.. that way we can limit memory consumption, but still reap the benefits to some degree.

fmunch commented 4 years ago

Thank you for the quick response.

You are right, caching everything is too aggressive. But a fixed size cache should work pretty well. Even a size of 1 covers most use cases. Traversing Database::fromDisk()->onlyCharacters() stays below 35 MB peak memory usage.

nick-jones commented 4 years ago

Sorry for the delay with this. Merging now - and I'll push some changes to master after that covert this into an LRU style cache, since it's not particularly difficult to do & prevents any chance of excessive memory usage :+1:

nick-jones / php-ucd

Improving FileRepository::yieldByCodepoints performances #3