loophp / collection

A (memory) friendly, easy, lazy and modular collection class.
https://loophp-collection.rtfd.io/
MIT License
721 stars 35 forks source link

Memory size exhausted for large collections #332

Closed jazithedev closed 7 months ago

jazithedev commented 9 months ago

Hello 👋.

Let's assume I have an entity:

final readonly class MyEntity
{
    public function __construct(
        public int $id,
    ) {
    }
}

When operating on collections with high amount of elements (1000 - 2000), the system crashes because of memory size being exhausted. The case can be easily simulated with a PHPUnit test:

public function testCollection(): void
{
    $input = $this->findAllFromRepository()->shuffle();

    self::assertEquals(2000, $input->count());
}

// Just a separate method which "simulates" to be a method from a repository
public function findAllFromRepository(): Collection
{
    $input = [];

    for ($i = 0; $i < 2000; ++$i) {
        $input[] = new MyEntity(id: $i);
    }

    return Collection::fromIterable($input);
}

On my side the limit is set on 128 MB:

Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 4096 bytes)

The crash happens in most of the test runs (so not all of them). It's triggered by either $collection->count() or $collection->all(). When using PHP shuffle(...), the test passes without the crash:

public function testArrayShuffle(): void
{
    $input = $this->findAllFromRepository()->all();
    shuffle($input);
    $collection = Collection::fromIterable($input);

    self::assertEquals(2000, $collection->count());
}

It is also all fine when working on arrays only (not using collection at all in the code).

What I could do in such a case? Isn't there some kind of memory leak? This issue came from my real-world example. Of course, it has a bit more operations on the collection, but even the simple as above fails. I'd rather to stay with collections to have a fine API working on my data, but I'm not sure if I can.

Thanks in advance for any help 🙏.

drupol commented 9 months ago

Hello,

In this scenario, it appears that the problem may not be entirely due to the Collection itself.

When using operations such as shuffle, count and all, the entire collection needs to be evaluated all at once. This approach negates the advantages of lazy evaluation, which is likely the primary cause of the Fatal error you're encountering.

After making some tests:

<?php

declare(strict_types=1);

namespace App;

use loophp\collection\Collection;

include __DIR__ . '/vendor/autoload.php';

class MyEntity {
    public function __construct(
        public int $id,
    ) {
    }
};

$input = [];

for ($i = 0; $i < 2000; ++$i) {
    $input[] = new MyEntity(id: $i);
}

$c = Collection::fromIterable($input)->shuffle();

foreach ($c as $v) {
    var_dump($v);
}

It looks like this also fails.

I guess that's simply because 128M is not enough?

Efficiently shuffling a collection requires keeping track of items in an array(source: https://github.com/loophp/iterators/blob/main/src/RandomIterableAggregate.php), I guess that's why we have that behaviour. In this case, you have no other choice than raising the memory.

Using the PHP shuffle method requires using an array, what we try to avoid when using Collection. The implementation of shuffle in PHP is unknown to me, but the one I implemented in RandomIterableAggregate is Fisher-Yates.

jazithedev commented 9 months ago

Not sure if increasing the memory would be the best solution. For instance, having your code example we could probably exchange this line:

$c = Collection::fromIterable($input)->shuffle();

for

$c = Collection::fromIterable($input);
$toShuffle = $c->all();
shuffle($toShuffle);
$c = Collection::fromIterable($toShuffle);

and it will work without reaching the memory limit. (assuming the last 3 lines of code are encapsulated within a method of a class, and $c is a parameter of that method)

I've made such change as above in my production code, and it has confirmed my guesses. It seems the implementation of shuffle(...) from RandomIterableAggregate somehow takes a lot of unneeded memory 🤔.

drupol commented 9 months ago

Do you see a better way to fix RandomIterableAggregate ? I'm definitely willing to improve it if you have ideas.

jazithedev commented 9 months ago

Honestly, I didn't check it. I merely just recently found out what exactly is crashing my production code and how to fix it 😅.

drupol commented 9 months ago

It would be nice to have a bit of help on this. I did my best implementing the Iterator aggregates, but there might be room for improvements.

AlexandruGG commented 8 months ago

I had a little look at this as well 🎲 .

Indeed it seems that in comparison with the PHP shuffle() function the one from Collection supports way fewer elements before running into the memory issue. However, even with just the PHP shuffle and using array only, you still get into memory trouble around 1.5 million elements - so this should be expected.

Regarding the implementation, from what I've read the shuffle() function also uses the Fisher-Yates algo.

One potential cause could be that the Collection implementation uses recursion: https://github.com/loophp/iterators/blob/main/src/RandomIterableAggregate.php#L61.

The PHP implementation is a bit more complex but doesn't seem to use recursion: https://github.com/php/php-src/blob/765382eeeaffee74e0f0f677ccf5b3504abe337a/ext/standard/array.c#L3107.

I see a couple of routes to take: a) refactor the RandomIterableAggregate to not use recursion b) refactor the RandomIterableAggregate to use the PHP shuffle() function internally -> I noticed this is what other PHP collection implementations are doing; even for "lazy" collections they transform to array, use the function, then transform back to a Collection

@drupol what do you think?

Also @jazithedev would you be up for improving the implementation with the above if we agree on the approach?

drupol commented 8 months ago

Hi Alex!

Good to see you around and have your POV on this.

a) I'm aware that most probably the memory issue comes from recursion, I've already tried to look for a replacement to this, but as of today I haven't found a better way to do this. b) I'm trying to avoid converting things back and forth (this is a nonsense in a lazy collection library), but I have to admit that for doing a valid shuffling, there's no other option.

TBH, at this point, I don't know yet what is the best path here.

I guess using shuffle with an array will be the faster solution since we don't do the shuffling in userland, maybe this worth a try... are you willing to help on this?

drupol commented 8 months ago

I've just published a draft PR refactoring the RandomIterableAggregate and removing recursion, while still implementing FisherYates.

Using the following test:

<?php

declare(strict_types=1);

namespace App;

use loophp\collection\Collection;

include __DIR__ . '/vendor/autoload.php';

class MyEntity
{
    public function __construct(
        public int $id,
    ) {}
}

$input = function (int $size) {
    for ($i = 0; $size > $i; ++$i) {
        yield new MyEntity(id: $i);
    }
};

$size = 100000;
$c = Collection::fromCallable($input, [$size])->shuffle();

foreach ($c as $v) {}

That's the one you posted on the first post, I can go up to 80000 items ! (Using 128Mb and PHP 8.2) That's already a huge improvement!

However, could you please have a look and let me know what you think of it?

jazithedev commented 8 months ago

I'll try to check all out tomorrow 🙂.

AlexandruGG commented 8 months ago

I've just published a draft PR refactoring the RandomIterableAggregate and removing recursion, while still implementing FisherYates.

Using the following test:

<?php

declare(strict_types=1);

namespace App;

use loophp\collection\Collection;

include __DIR__ . '/vendor/autoload.php';

class MyEntity
{
    public function __construct(
        public int $id,
    ) {}
}

$input = function (int $size) {
    for ($i = 0; $size > $i; ++$i) {
        yield new MyEntity(id: $i);
    }
};

$size = 150000;
$c = Collection::fromCallable($input, [$size])->shuffle();

foreach ($c as $v) {}

That's the one you posted on the first post, I can go up to 80000 items ! (Using 128Mb and PHP 8.2) That's already a huge improvement!

However, could you please have a look and let me know what you think of it?

Hey @drupol,

Nice work coming up with that implementation! Sounds like the performance gains would be definitely something desirable. If you don't see any downsides we should definitely go with it.

An additional approach would be enhance the Collection::shuffle method with a parameter that allows using PHP's shuffle function instead - either a boolean or a closure. However, we could also argue that this might not be necessary, since for a very large collection the user can still transform to array and apply the function on that.

drupol commented 8 months ago

Nice work coming up with that implementation! Sounds like the performance gains would be definitely something desirable. If you don't see any downsides we should definitely go with it.

OK ! I'll push the PR further ASAP.

An additional approach would be enhance the Collection::shuffle method with a parameter that allows using PHP's shuffle function instead - either a boolean or a closure. However, we could also argue that this might not be necessary, since for a very large collection the user can still transform to array and apply the function on that.

Your suggestion is precisely what I expected, and I believe you're already familiar with my stance on such modifications. While integrating this feature might improve performance, it raises a question: if we do this for shuffle, why not apply the same logic to other operations like map, filter, etc...? This is a path I'm hesitant to go down. Implementing these changes for one method might set a precedent for others, leading us into a cycle of continuous modifications, exceptions and added complexity, which I believe is best avoided. Using loophp/collection does affect performance, but its fully lazy nature is a trade-off we've consciously chosen. Adding such features would definitely increase the complexity of our codebase, something I'm actively working to reduce at all cost. I hope you understand my position on this matter. While I'm open to further discussion, my current inclination is to maintain our existing approach without introducing these kinds of specific alterations.

drupol commented 8 months ago

It should be fixed on master branch, please feel free to test and report some feedback

jazithedev commented 8 months ago

Hey @drupol. I've tested the newest branch and indeed it does not timeout for bigger collections now 🙂. I apologize I could not help more 🙁.

drupol commented 8 months ago

That's so great to see! Going to cut a bugfix release this morning.

drupol commented 7 months ago

Closing the issue, version 7.5.1 has been released and fix the issue.

Thank you all for your participation!