napoler / ganon

Automatically exported from code.google.com/p/ganon
0 stars 0 forks source link

Memory Leaks, and solution for... #20

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Hi!

First off all, thank you for your ganon library... it's a great code (very 
useful for me)

Well, I am writing any web scrappers with ganon, and, I have memory leak 
problems with large run scripts.

At the end, Linux ends my script because exhausts all the memory availlable 
(even the swap partition) and leave
my server runninng very slowly... sic.

I took a few days investigating why these memory leaks, and, this is the 
complete conclusions:

First, a little test to see the memory leak running with ganon:

<?php
include('lib/ganon.php');

set_time_limit(3600); //For slow servers...

function ParseTest($TheHtml){

  //do serveral Parses to check memory liberation
  //without leaving the function scope:
  for ($f = 1; $f <= 20; $f++){

   $test_html = new HTML_Parser_HTML5($TheHtml);
   $span=$test_html->root->select('span[class="IsThis"]',0);

   //Test if the select works...
   if (!$span) echo 'Select Error...';
  }//for f

}//ParseTest

echo '<pre>';
echo 'Php Version:'.phpversion().'<br><br>';

//Build an html for testing
$test_string=str_repeat('<div><span class="NOIsThis">Foo</span></div><div><span 
class="IsThis">Bar</span></div>',40);

//Loop for testing memory consumption
for ($i = 1; $i <= 20; $i++){
 ParseTest($test_string);
 echo sprintf( '>>>>>>>>>> Iteration: %4s, Memory Usage: %8s <br>',
               $i,number_format(memory_get_usage()) );

}
echo '</pre>';
?>

If I Run the test with the original ganon ( Ganon single file PHP5 (rev. #72) 
), the test script stops because it consume
all the memory available for php (I think in my case is 128MB).

This is the output of the test:

Php Version:5.2.14

>>>>>>>>>> Iteration:    1, Memory Usage: 9,277,760 
>>>>>>>>>> Iteration:    2, Memory Usage: 18,021,192 
>>>>>>>>>> Iteration:    3, Memory Usage: 26,567,912 
>>>>>>>>>> Iteration:    4, Memory Usage: 35,508,408 
>>>>>>>>>> Iteration:    5, Memory Usage: 44,055,968 
>>>>>>>>>> Iteration:    6, Memory Usage: 52,602,616 
>>>>>>>>>> Iteration:    7, Memory Usage: 61,935,256 
>>>>>>>>>> Iteration:    8, Memory Usage: 70,482,696 
>>>>>>>>>> Iteration:    9, Memory Usage: 79,028,872 
>>>>>>>>>> Iteration:   10, Memory Usage: 87,575,696 
>>>>>>>>>> Iteration:   11, Memory Usage: 96,122,120 
>>>>>>>>>> Iteration:   12, Memory Usage: 104,669,872 
>>>>>>>>>> Iteration:   13, Memory Usage: 113,216,320 
>>>>>>>>>> Iteration:   14, Memory Usage: 123,336,072 
>>>>>>>>>> Iteration:   15, Memory Usage: 131,883,464 

Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to 
allocate 3441 bytes) in \lib\ganon.php on line 247

Whell, like I set, I took a few days investigating this problem and... there 
are two things I found that cause these
memory leaks:

- You must destroy the baseclass of any extended class (when is necessary), by 
calling parent::__destruct();
- The callback functions created can not be destroyed, and Ganon creates alot 
of these callback functions.
  Reference: The comments of the php manual in: http://php.net/manual/en/function.create-function.php

Always is better not use autogenerated code.

I made all these modifications in ganon.php, creating: nml_ganon.php and, using 
it, this is the result of the previous
test:

Php Version:5.2.14

>>>>>>>>>> Iteration:    1, Memory Usage:  600,712 
>>>>>>>>>> Iteration:    2, Memory Usage:  600,712 
>>>>>>>>>> Iteration:    3, Memory Usage:  600,712 
>>>>>>>>>> Iteration:    4, Memory Usage:  600,712 
>>>>>>>>>> Iteration:    5, Memory Usage:  600,712 
>>>>>>>>>> Iteration:    6, Memory Usage:  600,720 
>>>>>>>>>> Iteration:    7, Memory Usage:  600,720 
>>>>>>>>>> Iteration:    8, Memory Usage:  600,720 
>>>>>>>>>> Iteration:    9, Memory Usage:  600,720 
>>>>>>>>>> Iteration:   10, Memory Usage:  600,720 
>>>>>>>>>> Iteration:   11, Memory Usage:  600,720 
>>>>>>>>>> Iteration:   12, Memory Usage:  600,720 
>>>>>>>>>> Iteration:   13, Memory Usage:  600,720 
>>>>>>>>>> Iteration:   14, Memory Usage:  600,720 
>>>>>>>>>> Iteration:   15, Memory Usage:  600,720 
>>>>>>>>>> Iteration:   16, Memory Usage:  600,720 
>>>>>>>>>> Iteration:   17, Memory Usage:  600,720 
>>>>>>>>>> Iteration:   18, Memory Usage:  600,720 
>>>>>>>>>> Iteration:   19, Memory Usage:  600,720 
>>>>>>>>>> Iteration:   20, Memory Usage:  600,720 

Ok, no memory leaks...

NOTE:
I just changed one of the callback functions (for now), but the code has others 
create_function in 
the getChildrenByAttribute function of HTML_Node class, so... its not complete 
yet (maybe in a days I will
finish this)

I dont know if here I can attach a file (I will tray), in any case, I have put 
the file accesible in one of my servers, at:

http://trucomania.org/inaki/nml_ganon_rev72.zip

I hope you think in this for your next revisión. I will change the rest of the 
callbacks when I found time.

Thanks again for your great library! 

Original issue reported on code.google.com by Radika...@gmail.com on 20 Sep 2012 at 9:42

Attachments:

GoogleCodeExporter commented 9 years ago
Ok, done.

I replace all the create_function of the code (and fix other problems: 
http://code.google.com/p/ganon/issues/detail?id=21 )

This version parse data in loops of more than 100k iterations without problems.

The file can be downloaded also from: 
http://trucomania.org/inaki/nml_ganon_rev72_2.rar

Regards!

Original comment by Radika...@gmail.com on 20 Sep 2012 at 12:04

Attachments:

GoogleCodeExporter commented 9 years ago
Hi again (Im talking alone :)

My version nml_ganon_2 still have memory leaks in some pages... :(

I think is for a circular reference of the root node.

For example, www.amazon.es

So... I re-check the class constructors and destructors, and modify the 
destruction
of the nodes, starting with the root node.

Now, all works fine.

I added too a new version of select which fires an exception if the select 
returns a null.

You have the file also in:

http://trucomania.org/inaki/nml_ganon_rev72_3.zip

Original comment by Radika...@gmail.com on 22 Sep 2012 at 4:47

Attachments:

GoogleCodeExporter commented 9 years ago
Thanks for the modification. I'm using it now as well. I recommend the author 
review this and commit.

Original comment by sjwood...@gmail.com on 18 Oct 2012 at 1:40

GoogleCodeExporter commented 9 years ago
Thanks for your report and hard work! I apologize for responding this late.
This is indeed caused by the circular reference with the root/child nodes. Your 
fix works in your example, but will not work in combination with 
str_get_dom/file_get_dom. I haven't found a solution yet, so the best bet would 
probably be to manually call $test_html->root->clear() after you're done. 
Perhaps I should write a wiki page about that.

I did get rid of the create_function structures in the new version :)

Original comment by niels....@gmail.com on 19 Oct 2012 at 5:38

GoogleCodeExporter commented 9 years ago
"so the best bet would probably be to manually call $test_html->root->clear() 
after you're done"

Hummm is the first thing I did, and dont works... dont free memory even you 
call clear(), unset and many other things I could try :) (I was desperated!)
All the nodes remains in memory even you call clear()

One of my scripts runs about 900K iterations each day, and each day, the 
scripts ends my server because lack of memory.

Your method to free all the nodes... I dont understand it, so I wrote a new, 
take a look (it works)

Took me days of debugging, but I learned a lot of php :)

Finally, I dont use ganon anymore, because the lack of speed, but is a great 
product.

Original comment by Radika...@gmail.com on 19 Oct 2012 at 6:06

GoogleCodeExporter commented 9 years ago
I tried the nml_ganon.php, but it fails with this code while the original 
ganon.php works fine:

            $mail->setOuterText(
                 '<script type="text/javascript">'
                .'document.write(deobfuscate(\''.str_rot13(base64_encode($mail->html())).'\'));'
                .'</script>'
                .'<noscript>Please enable JavaScript to see the eMail address</noscript>'
            );

It works with adding the <script> only, but when trying to add the <noscript> 
part it fails.

Original comment by google.2...@spamgourmet.com on 22 Apr 2013 at 5:38

GoogleCodeExporter commented 9 years ago
The problem with the memory leaks stems from the fact that HTML_Parser creates 
a node tree that does not get destroyed in the destructor and so both the tree 
and the parser remain.

Attached is a patch that adds a destructor to HTML_Parser and uses a wrapper 
class that will call this destructor.

Second attachment is an example script. Restoring old behaviour is easy. You 
will clearly see the memory usage explode.

Original comment by leonard...@gmail.com on 11 Sep 2014 at 3:06

Attachments:

GoogleCodeExporter commented 9 years ago
Hi Everyone, 

I want to use this library. Can anyone please tell me that Is this memory 
problem fixed and code is committed? 

Also, Is it works better than other like PhpSimpleDom?

Thanks

Original comment by sirs...@gmail.com on 23 Feb 2015 at 8:08