helgeho / Web2Warc

An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)
MIT License
24 stars 4 forks source link

OutOfMemoryError #4

Closed dportabella closed 5 years ago

dportabella commented 7 years ago

Running this program:

  val maxLevel = 2
  val outPath = "/tmp/out"
  val domain = "epfl.ch"
  val seedUrl = "http://www.epfl.ch"

  Web2Warc.crawl.name = domain
  Web2Warc.writer.path = outPath
  Web2Warc.spec.maxLevel = maxLevel
  Web2Warc.spec.urlRegex = "[^:]+://([^/]*.|)" + java.util.regex.Pattern.quote(domain) + "(/.*|$)"
  Web2Warc.spec.urlRegex = "[^:]+://([^/]*.|)" + domain + "(/.*|$)"
  Web2Warc.spec.preserveUrlRegex = ".*" // preserve all crawled URLs (default)
  Web2Warc.spec.followRedirects = true
  Web2Warc.spec.increaseLevelOnRedirect = true
  Web2Warc.seeds += seedUrl

  Web2Warc.run()

Fails with OutOfMemoryError:

::Queue state:: current level: 1, URLs in queue: 114
::Queue state:: current level: 1, URLs in queue: 137
...
::Queue state:: current level: 2, URLs in queue: 1020
::Queue state:: current level: 2, URLs in queue: 1019
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

Any idea? Is the program scalable, or the url queue is keep in memory? Still, 2000 urls do not seem too much to cause an OutOfMemoryError.

helgeho commented 7 years ago

Hi David, I've run crawls with a lot more than 2000 urls in the queue, so that's usually not a problem. Can you increase your heapspace? i.e., java paramter -Xmx

helgeho commented 7 years ago

Is this problem solved? As I said, it usually works for me...

And I just commented on the other issue that Web2Warc is finally available on Maven Central:

libraryDependencies += "com.github.helgeho" % "web2warc" % "1.1"

After a long busy period, I finally get some time for maintain my projects and push some new features. Hope this helps and I would be happy if you would try it again!

dportabella commented 7 years ago

I tried the last version, but I get the same outofmemory error. I'll try as soon as I understand how to increase the heapspace in Ammonite.