opensangja / abot

Automatically exported from code.google.com/p/abot
Apache License 2.0
0 stars 0 forks source link

Limit memory usage for the process running Abot #55

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Limit memory usage for the process running Abot. A user asked...

"What happens when you run out of RAM to store the cached Uris you have 
crawled?" 

Instead of limiting this list's size, it would be better to limit the entire 
process's memory consumption in case other list/references are added.

Original issue reported on code.google.com by sjdir...@gmail.com on 25 Dec 2012 at 11:58

GoogleCodeExporter commented 9 years ago
After quite a bit of back and forth I decided not to implement this for the 
reasons below.

1: Most methods available (listed below) to detect memory usages only provide 
estimations and are not a good enough metric to count on. We could be stopping 
the crawl prematurely based on bad estimations.
-System.GC.GetTotalMemory()
-Process.GetCurrentProcess().VirtualMemorySize64
-Environment.WorkingSet)
-There are several others but all have the same problem

2: Just because you have 2gb of memory allocated to a process and the sum of 
all your object instances is below 2gb doesn't mean that you wont have memory 
exceptions/problems. If you are using a List<string> (which is backed by an 
array) and say it gets up to about 600mb in size, you can get a 
OutOfMemoryException if there is not 600mb of contiguous memory space (which 
the array requires) to load it or store it in memory. Point being, that just 
because the process memory used is below the available memory doesn't mean your 
ok. If this feature were implemented it would give a false sense of memory 
safety. 

I believe the key to properly handling memory issues from within a crawler is 
by using the "MaxPagesToCrawl" config item. If you want to crawl millions of 
pages in a single continuous crawl you need to make some estimates of how many 
page crawls your server can handle. See this forum post on how to determine the 
amount of hardware you'll need for your intended purpose. Also be sure to take 
page file size into about (at least 1.5 time the ram)...

https://groups.google.com/forum/?fromgroups=#!topic/abot-web-crawler/rsICtZgzpRQ

Bottom line is dynamically handling memory consumption/limits doesn't help much 
from a crawler perspective. There are to many variables to consider and it can 
actually cause you more problems than your trying to solve. My advice is to set 
the "MaxPagesToCrawl" value to what you estimate (from the post above) your 
machine can handle.

Original comment by sjdir...@gmail.com on 28 Dec 2012 at 10:57

GoogleCodeExporter commented 9 years ago
Found a dependable way to handle this using....

http://msdn.microsoft.com/en-us/library/system.runtime.memoryfailpoint(v=vs.100)
.aspx

Reopening this ticket to implement

Original comment by sjdir...@gmail.com on 27 Feb 2013 at 7:25

GoogleCodeExporter commented 9 years ago

Original comment by sjdir...@gmail.com on 1 Mar 2013 at 1:06

GoogleCodeExporter commented 9 years ago
This issue was closed by revision r293.

Original comment by sjdir...@gmail.com on 13 Mar 2013 at 9:18

GoogleCodeExporter commented 9 years ago
This issue was closed by revision r294.

Original comment by sjdir...@gmail.com on 13 Mar 2013 at 9:51