Use JOBDIR of scrapy - Githubissues

Schulzjo commented 2 years ago

Hey, I am a little bit confused about the JOBDIR setting of scrapy. I am struggeling actually with some kind of memory leak (or not, I am not sure...) I found this stack overflow thread: https://stackoverflow.com/questions/30441191/memory-leak-in-scrapy

and also this sentence in scrapy docs:

"By default Scrapy keeps the request queue in memory; it includes Request objects and all objects referenced in Request attributes (e.g. in cb_kwargs and meta). While not necessarily a leak, this can take a lot of memory. Enabling persistent job queue could help keeping memory usage in control."

Both results in using the JOBDIR setting.

But triggering it with

curl http://localhost:6800/schedule.json -d project=default -d spider=myspider -d some_arg="something" -d settings=JOBDIR=crawls/my_spider do not create a crawls folder and seems not to have any effect.

Is the jobstorage = scrapyd.jobstorage.SqliteJobStorage` setting in scrapyd.conf the same as the JOBDIR?

or does the jobstorage just store queued jobs.

Where are the request located during scraping?

I have to reduce my memory consumption in big/long scraping jobs.

Schulzjo commented 2 years ago

Ok, ssems the syntax of my curl request was wrong. Using python scrapyd api packege, the crawls folder gets created

pip install python-scrapyd-api

from scrapyd_api import ScrapydAPI
scrapyd = ScrapydAPI('http://localhost:6800')
s = {'JOBDIR': 'crawls/my_spider'}
scrapyd.schedule("default", "myspider", settings=s)

so, how do I give these settings to scrapyd with curl?

And what is the result in comparison to jobstorage?

jpmckinney commented 2 years ago

You have a typo in the curl command. -d setting=... not -d settings=. https://scrapyd.readthedocs.io/en/stable/api.html#schedule-json

You can read about job storage here. However, storing jobs requires extremely little memory (unless you are running millions of jobs, maybe): https://scrapyd.readthedocs.io/en/stable/config.html#jobstorage

The memory leak is likely in your spider's code. Scrapyd and Scrapy can't help with that. Look for anything that you are keeping in memory between requests. The sentence from Scrapy's docs is only relevant if you have millions of requests in the queue. In that case, don't queue up so many requests in your spider (e.g. handle some requests before queuing more). Keep in mind that this is the queue (i.e. unsubmitted requests), not all the requests ever sent.

scrapy / scrapyd

Use JOBDIR of scrapy #452