schmittjoh / JMSSerializerBundle

Easily serialize, and deserialize data of any complexity (supports XML, JSON, YAML)
http://jmsyst.com/bundles/JMSSerializerBundle
MIT License
1.8k stars 311 forks source link

JMS Serializer perfomance issues with more thant 10000 entries #762

Open malcomhamelin opened 5 years ago

malcomhamelin commented 5 years ago

Currently I'm building a PHP command that can update my ElasticSearch indices.

But, a big thing I've noticed is that serializing entities when my array holds more than 10000 of them is taking way too much time. I thought it would be linear, but either 6 or 9k entities takes like a minute (not much difference between 6 or 9k), but when you go past 10k, it just slows down to the point of taking up to 10 minutes.

...
                // we iterate on the documents previously requested to the sql database
                foreach($entities as $index_name => $entity_array) {
                    $underscoreClassName = $this->toUnderscore($index_name); // elasticsearch understands underscored names
                    $camelcaseClassName = $this->toCamelCase($index_name); // sql understands camelcase names

                    // we get the serialization groups for each index from the config file
                    $groups = $indexesInfos[$underscoreClassName]['types'][$underscoreClassName]['serializer']['groups']; 

                    foreach($entity_array as $entity) {
                        // each entity is serialized as a json array
                        $data = $this->serializer->serialize($entity, 'json', SerializationContext::create()->setGroups($groups));
                        // each serialized entity as json is converted as an Elastica document
                        $documents[$index_name][] = new \Elastica\Document($entityToFind[$index_name][$entity->getId()], $data);
                    }
                }
...

There's a whole class around that but that's what is taking the most of the time.

I can get that serialiazing is a heavy operation and that it takes time, but why is there next to no difference between 6, 7, 8 or 9k, but when above 10k entities it juste takes a lot of time ?

PS : for reference, I've asked the same thing on StackOverflow.

goetas commented 5 years ago

The code you have posted mentions "entities", are you using doctrine?

If you are using doctrine, I do not see any memory cleanup in the loops. This means that the memory goes up on each iteration since doctrine has to instantiate all the visited objects, thus inevitably you will see slowdowns.

goetas commented 5 years ago

Here some info on how to clean up memory in doctrine when dealing with big datasets https://www.doctrine-project.org/projects/doctrine-orm/en/2.6/reference/batch-processing.html

afraca commented 4 months ago

@goetas We're building a simple DTO for serialization to XML, which has an array with about 4k objects in it. So no Doctrine involved, Serialization takes 2 minutes, and it doesn't seem to scale linearly. I have also attached a flamegraph provided by xdebug. (this was for 2k or 4k products, not sure anymore) out3

<!DOCTYPE html>

Products | Serialize (s) | Serialize (MiB) -- | -- | -- 520 | 0.8 | 68 976 | 2.5 | 76 1964 | 13 | 94 3961 | 115 | 131

edit: The structure of the DTO also affects performance of course, number of properties on it, and the depth. I'm not sure how much I'm allowed to share here, but the array items have a handful of properties, and the depth you can derive from the flamegraph, so it's about 4 or 5 levels.

scyzoryck commented 4 months ago

@afraca - in general it seems to be a kinda big set of data 😅 If you will find any possible improvements - feel free to create an MR for it. From my side: Please check if you can disable some features - for example clean event listeners if you are not using them, exclusion strategies, etc. Also using some handlers to serialise some of the classes might improve performance.

afraca commented 4 months ago

Hey @scyzoryck , thanks for replying! I only now realized this is in the bundle repository, not the actual serializer github repo. Sorry about that! If you want I can open a bug there or we continue here.

I find the quadratic behaviour the most interesting. If it would grow linearly it would be fine. If for 100 products it's 1 second, and for 1k products it's 10s that's fine with me, I can schedule a job in a queue and anywhere < 10 minutes is fine. But it grows too quickly. (For all our products > 1 day... ) That implies that it's scanning something too much somewhere.

I have tried to comment out all kinds of low hanging fruit in \JMS\Serializer\GraphNavigator\SerializationGraphNavigator::accept :

Unfortunately no real results.... I know exception handling can slow things down quite a bit sometimes, and I saw the serializer library makes use of exceptions to communicate "normal" stuff as well, but that did not get me anywhere.

One thing currently on my mind is about the foreach in \JMS\Serializer\XmlSerializationVisitor::visitArray. Maybe the appending of child etc slows down quite a lot....

scyzoryck commented 4 months ago

I suspect 'JMS\Serializer\XmlSerializationVisitor' class. Serializer package has some performance tests - looking at them xml is 50% slower for same data set in compare to json.

Please also make sure that you use latest serialiser package. Last year I merged few improvements to memory usage and performance.

If you are going to work on the big data sets I'm not sure if serialiser is the best choice. I would check flow-php library that offers ETL pattern.