Open stephanegigandet opened 4 years ago
I've spent some time trying to profile the import products Perl script. I tried a number of memory profiling packages. I was able to get output from Devel::Gladiator but it was hard to understand in some parts and other parts were just not useful. I tried Devel::Leak and Test::Memory::Leak and the process just hung and did not return anything.
In general, I would recommend having as few variables in global scope that are exported and shared between packages as possible. I noticed a lot of global scoping and exports which could be a concern. These are less likely to be garbage-collected. You should prefer variables in methods as they will be garbage-collected once the method is done executing.
I found a nice module called Devel::NYTProf. It does not profile memory so may not be ideal for our case. However, it has a detailed and expressive analysis of code execution times. It gives a graphical output that is very intuitive and rich. It is easy to install and run without changing any code in your program. I would suggest you run it. https://metacpan.org/pod/Devel::NYTProf
The output revealed a couple of things that might be worth looking further at:
Here is a screenshot of just some of the report when I ran the script on 100 products that had already been imported:
Here is a tree map of the time spent in each function. It is very informative but may not be helpful for our problems:
You may also want to make sure all of your CPAN modules are updated to the latest versions as older versions may have bugs that contribute to the memory leaks.
Make sure you don't have any circular references in your code. Perl cannot garbage collect them https://perlmaven.com/eliminate-circular-reference-memory-leak-using-weaken There is a module to help you find them: https://metacpan.org/pod/Devel::Cycle
A good page with some references to Perl best practices to memory management: https://metacpan.org/pod/MemoryProfiling
I am also looking at the Devel::MAT which seems like the most robust memory profiler suite for Perl. Here is a good tutorial if you want to get started:
http://perladvent.org/2017/2017-12-12.html
Note that you don't have to use a die();
command to get the dump, you can capture it on a signal interrupt (Ctrl-\) or at the end of the script, see https://metacpan.org/pod/Devel::MAT::Dumper
You can also use Memory::Stats to measure memory at checkpoints you specify in your code. You could see how the memory changes before and after certain blocks of code: https://metacpan.org/pod/Memory::Stats
Hello @zigouras ! Thank you very much for taking the time to do this analysis!
The reports from Devel::NYTProf are very interesting, knowing which function is called the most and from where is going to be very useful, I'll definitely investigate what we can do for ProductOpener::Display::remove_tags_and_quote
I read about the circular references in the past but I wasn't able to identify where I'm creating some, I don't think I create them (at least not intentionaly).
Generally, according to Devel::MAT
, the largest objects (after importing 100 products) are still synonyms (with some 20 MiB each), which should be pretty constant.
I also tried outputting Devel::MAT
after every import iteration and ran pmat-leakreport on three files (1st, 50th, and 100th iteration). I'm not sure if any of the potential leaks actually are leaks, but to me, it would not make too much sense to have that many matches of MongoDB instances being alive after 100 iterations ...
I ran the latest import csv script on all 1000 items and it only used 0.5 GB of memory and was not increasing on my laptop. I also used Devel::MAT and the pmap program to analyze the memory dump after 900 records were inserted. I didn't see any suspicious objects, the largest SV was the strtab at about 30 MB. The rest of the objects were distributed pretty evenly in terms of size.
My Devel::MAT analysis:
pmat [more]> count
Kind Count (blessed) Bytes (blessed)
ARRAY 225525 40 22.2 MiB 3.5 KiB
CODE 17291 10 2.1 MiB 1.2 KiB
GLOB 31189 2 4.5 MiB 304 bytes
HASH 321132 4435 200.6 MiB 1.3 MiB
INVLIST 490 1.1 MiB
IO 35 35 5.5 KiB 5.5 KiB
PAD 14600 4.8 MiB
REF 547867 12.5 MiB
REGEXP 3445 201 753.6 KiB 44.0 KiB
SCALAR 3759076 17 198.2 MiB 1.2 KiB
STASH 1780 1.5 MiB
------- ------- --------- --------- ---------
(total) 4922430 4740 448.3 MiB 1.4 MiB
Thanks @hangy for his analysis. We may want to review how we use MongoDB in Perl. Database connections are always a likely place for performance issues. Do we use connection pooling? I will keep looking.
It looks like certain Perl regular expressions can take up a lot of memory. If a regex contains $&, $` or $' it could be a big problem. https://stackoverflow.com/questions/165660/why-is-my-perl-regex-using-so-much-memory
I am trying to get https://metacpan.org/pod/Devel::FindAmpersand installed but I am getting errors. It doesn't seem to work on perl later than 5.18.4 per the CPAN unit test so I am downloading that and trying to build it. If anyone can get Devel::FindAmpersand
working it seems the best way to find the regex culprits in our code and all included modules.
Update: I built perl 5.18.4 and successfully installed Devel::FindAmpersand with it but couldn't get it to work with other project dependencies.
I used simple grep
commands to find the ampersand regexes and found a lot in our ProductOpener library. See my report here. Refactoring these seems like a good option to try to reduce memory.
Update: It looks like Devel::NYTProf reports on these regexes. I will have to inspect its output again, I wasn't looking for it the first time.
We may want to review how we use MongoDB in Perl. Database connections are always a likely place for performance issues. Do we use connection pooling?
https://github.com/openfoodfacts/openfoodfacts-server/blob/8cb6fd279409ebd09e72de5e2504b3e14e0b93dc/lib/ProductOpener/Data.pm#L90 holds a single $client
reference. As per https://metacpan.org/pod/MongoDB::MongoClient, the actual socket is opened lazily and should be open until it goes out of scope. We don't really clean up the $client
(which I think is for the web app to reuse the client?), but there shouldn't be any additional instances.
My Devel::MAT analysis shows that regexes use the most memory:
Really? It looks to be constant for me; After 50 products:
Kind Count (blessed) Bytes (blessed)
ARRAY 224130 39 20.7 MiB 3.4 KiB
CODE 18558 10 2.3 MiB 1.2 KiB
GLOB 33159 2 4.8 MiB 304 bytes
HASH 230322 4451 129.2 MiB 1.3 MiB
INVLIST 450 981.2 KiB
IO 37 37 5.8 KiB 5.8 KiB
PAD 15367 5.0 MiB
REF 456683 10.5 MiB
REGEXP 3491 196 763.7 KiB 42.9 KiB
SCALAR 2206545 17 116.7 MiB 1.2 KiB
STASH 1827 1.6 MiB
------- ------- --------- --------- ---------
(total) 3190569 4752 292.5 MiB 1.4 MiB
After 100 products:
pmat> count
Kind Count (blessed) Bytes (blessed)
ARRAY 224130 39 20.7 MiB 3.4 KiB
CODE 18558 10 2.3 MiB 1.2 KiB
GLOB 33159 2 4.8 MiB 304 bytes
HASH 230322 4451 129.3 MiB 1.3 MiB
INVLIST 450 981.2 KiB
IO 37 37 5.8 KiB 5.8 KiB
PAD 15367 5.0 MiB
REF 456683 10.5 MiB
REGEXP 3491 196 763.7 KiB 42.9 KiB
SCALAR 2210165 17 116.9 MiB 1.2 KiB
STASH 1827 1.6 MiB
------- ------- --------- --------- ---------
(total) 3194189 4752 292.7 MiB 1.4 MiB
Maybe I did something wrong (too small sample size might be a problem), but I only see a ~0.2 MiB increase in SCALAR.
@hangy Sorry I read the pmap count
report wrong, I thought REGEXP was listed in Mb but it is Kb. I do not see it growing over your run.
I still think it is worth investigating the problematic regexes as described above. I am not completely sure how to interpret the output of pmac. It may be that the problematic regexes are storing look-back data as scalars which is why that is the biggest type.
Here is another file to import with 100 real products, with full ingredients lists etc. They trigger much more processing involving ingredients analysis, taxonomies etc.
Something I forgot to mention: if you run the import multiple times, then the import script should detect that the product did not change, and there will be less processing (and the products won't be saved again in the disk and Mongo).
For the import script, the only MongoDB use should be when adding/updating each product.
Thanks for the update @stephanegigandet . I did my memory analysis on a run that was adding 1000 new test products to the system. I deleted all the products on disk and cleared the mongodb table before my run. I will try with your more extensive data set. I think this is the most promising clue at this point: Regex memory issues
@zigouras Indeed, as you found, I've been using $' and $`extensively, I wasn't aware that they caused memory issues. We can certainly try replacing them to see if it affects the memory usage, using variations of this:
"In general, apply /^(.)(pattern)(.)$/s and use $1 for $`, $2 for $& and $+ for $"
I think this is the most promising clue at this point: Regex memory issues
I agree that this should be tested. However, it's probably not as bad as the 2008 Stack Overflow answer makes it sound: According to perlvar, the gravest problems with those variables were supposedly fixed in 5.18 (2014) and 5.20 (2015):
In Perl 5.18.0 onwards, perl started noting the presence of each of the three variables separately, and only copied that part of the string required In Perl 5.20.0 a new copy-on-write system was enabled by default, which finally fixes all performance issues with these three variables, and makes them safe to use anywhere.
Performance wise (maybe not memory wise), it would probably be useful to see if we can optimize get_string_id_for_lang() which is by far the function that is called the most.
E.g. in one of my import test runs:
Calls | P | F | ExclusiveTime | InclusiveTime | Subroutine |
---|---|---|---|---|---|
5889063 | 35 | 8 | 143s | 174s | ProductOpener::Store::get_string_id_for_lang |
7453 | 1 | 1 | 52.6s | 260s | ProductOpener::Tags::spellcheck_taxonomy_tag |
13140748 | 63 | 1 | 45.7s | 45.7s | ProductOpener::Ingredients::CORE:regcomp (opcode) |
99 | 1 | 1 | 42.8s | 465s | ProductOpener::Ingredients::extract_ingredients_classes_from_text |
71301254 | 30 | 1 | 31.5s | 31.5s | ProductOpener::Store::CORE:subst (opcode) |
We could try to merge most of the =~ s// calls, and turn as many as possible into tr// calls instead which seem to be less costly.
Sounds good guys. Note that Devel::NYTProf reports on the problematic regexes. @hangy makes a good point that newer versions of Perl fixed the issue with these regex classes. However, I still think if we have memory issues it would be worth trying to refactor out these regexes to see if it improves memory.
Question: is the dev and production Perl compiled with thread support or not?
Stale issue message
Should we close this issue?
This issue should be tested to see if we still have the bug.
See bug description.
There is a big memory leak when we update a product. It can be seen in particular when we launch the import_csv_file.pl script with a large (e.g. 1000) amount of products : the process keeps growing until it runs out of memory.
For a real product import of a big producer (e.g. 5000 products), the process grows to 10 Gb of memory.
There are many different things that happen when a product is updated: the ingredients lists are processed, allergens and additives detected, quality is measured etc. Each of those steps involve calling many different subroutines, in particular many functions from Tags.pm related to taxonomies.
At this point, we do not know what exact part of the pre-saving processing causes the memory leak (it is very likely that several parts contribute to it).
This needs further investigation, maybe by testing each part individually (e.g. running each function tens of thousands of times and measuring memory), and/or with tools like Devel::Gladiator etc.
Steps to see the memory leak:
[..] (with thousands of products)
in /srv/off/scripts/ , run:
./import_csv_file.pl --user_id debug --org debug --source_id debug --source_name debug --source_url debug --define lc=en --csv_file debug-code-product_name-ingredients2.csv
debug-code-product_name-ingredients2.csv.zip
Related discussions/issues: #2053 #2054
This issue is very important to resolve, because it also happens (more gradually) in production on the web site: whenever a product is updated, the Apache mod_perl process grows, and eventually it needs to be restarted.
Part of
8764