Closed Falumpaset closed 2 months ago
Use jemalloc for memory, we need the visited links to prevent re-crawling the same url per run.
Is there any documentation on how to use spider with jemalloc? I see that theres a jemalloc feature flag. Could you please provide an example? Very much appreciated!
Is there any documentation on how to use spider with jemalloc? I see that theres a jemalloc feature flag. Could you please provide an example? Very much appreciated!
No, it just swaps the memory backend. You want to usually do this manually on your own at the top of your entry point. Spider can handle this with the flag you mentioned as well.
You can also use the “fs” feature flag to stream the response to disk and retrieve it async after finishing. This helps prevent memory held at once for the content.
@Falumpaset we now use string interning for links visited. This should help out too!
Hey,
Im crawling some sites in parallel. However, these sites are very big. The crawlers memory consumption does increase over time. Im countering that by increasing the swap size. This should be a temporary solution.
Is there a way to not have it store the visited pages in memory? I dont need them because I'm subscribing to the crawler and processing the visited pages on the fly. Any ideas?
See my implementation below.
Help is very much appreciated!
Kind regards.