Closed bizrockman closed 2 months ago
Hi @bizrockman,
please use a meaningful limit (instead of 1000000000) for --max-visited-urls
, because this value is used to pre-allocate memory for communication between crawler threads.
Set e.g. --max-visited-urls=20000
(default is 10000) and if that is not enough, then --max-visited-urls=50000
, etc.
I also recommend setting higher --memory-limit=4G
(default is 2G).
In case of Windows, if you would like to use parallel processing using more --workers
, I recommend to use the crawler in WSL Linux distro, where you can run e.g. Ubuntu or Debian.
For a list of all settings I recommend to read https://crawler.siteone.io/configuration/command-line-options/
And just to be sure, if you don't want to explore the possibilities of the CLI tool, you can use the SiteOne Crawler desktop application. It uses this CLI tool internally and also offers you the whole CLI command that was used in the generated report. You can help yourself with this, it's quite an effective help when you are just getting acquainted with the tool.
Let me know if I helped you :)
I would suggest to take another approach for the max-visited-urls. It should not directly related with the memory needed. I would use a disk approach (sqlite/duckdb maybe), so that I can choose that a whole website can be crawled without any thoughts about memory.
The same applies for --max-queue-length if an URL is to long the process ends, what I find very annyoing.
I like to crawl a whole website and the tool should crawl all pages regardless the length of an url, otherwise it would end up in a try & error session to find the correct values.
But the overall tool makes an excellent impression. :-D
Thank you for your feedback @bizrockman. I will consider other options in the future. In a few weeks, the Swoole 6 version should be released, which already provides mechanisms for communication between threads that do not require pre-allocation of memory. It certainly makes sense to eventually use the in-process database DuckDB or sqlite on the file system.
I already had the idea to make a part of the application in Rust and it would communicate with the main application via sockets. The optimization possibilities are endless :)
I try to "download" a website to be served as a local copy.
But it did not worked. Getting a
WARNING SharedMemory::alloc(): mmap(5810231990328) failed, Error: Resource temporarily unavailable[11]
Fatal error: Swoole\Table::create(): unable to allocate memory in /cygdrive/C/Development/tools/siteone-crawler/src/Crawler/Crawler.php on line 138
Should I use another option on Windows? Right now using Cmder.
Since I do not know how many urls are there to be visiting, I like to set that value to a high value. But this is not working.
I would use a sqlite db instead of an in memory approach. Would that solve the issue?
crawler.bat --url=https://www.example.com/ --offline-export-dir=tmp/example.com --allowed-domain-for-external-files='*' --allowed-domain-for-crawling='www.example.com' --max-visited-urls=1000000000
Still not sure if that is the right way of getting a local copy of an webpage.
Would prefer a simpler API with more default options.