Closed Erhanjinn closed 4 years ago
Try to configure a splash instance per Core with aquarium after you could write to us again. (The number of instances permit the concurrency of the executions)
https://github.com/TeamHG-Memex/aquarium
El vie., 29 mar. 2019 16:23, Erhanjinn notifications@github.com escribió:
Hello,
I would like to use scrapy-splash to scrape multiple sites from one domain in parallel fashion. The site uses javascript to render some things I am interested in.
I do get mixed responses however. When setting
CONCURRENT_REQUESTS = 2
or even
CONCURRENT_REQUESTS = 1
the responses get mixed and are not 100% correct.
I am creating the requests as follows:
yield scrapy.Request(url, self.parse, headers={'User-Agent': self.custom_user_agent}, meta={'splash': {'args': {'wait': 15}, 'endpoint': 'render.html', 'slot_policy': SlotPolicy.SINGLE_SLOT})
I tried to set slot_policy to both SINGLE_SLOT or PER_DOMAIN and it did not help.
What more should I set when scraping only one domain?
Thanks,
Jan
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scrapy-plugins/scrapy-splash/issues/213, or mute the thread https://github.com/notifications/unsubscribe-auth/Agwyu7tegEw93wTB-B4J9uMsplwU2JVcks5vbi_vgaJpZM4cSotR .
Thank you for your reply.
I followed the guide you have provided, but I have a question. Is there anything special I have to do to connect to this aquarium splash instance? Using SPLASH_URL = 'http://0.0.0.0:8050'
in settings.py
doesn't work.
Now, instead of
docker run -p 8050:8050 scrapinghub/splash
I ran
docker-compose up
in the aquarium folder created and I can see it is live
splash1_1 | 2019-03-29 17:20:37.065053 [-] "172.20.0.6" - - [29/Mar/2019:17:20:36 +0000] "GET / HTTP/1.0" 200 7677 "-" "-"
splash2_1 | 2019-03-29 17:20:37.370812 [-] "172.20.0.6" - - [29/Mar/2019:17:20:36 +0000] "GET / HTTP/1.0" 200 7677 "-" "-"
splash0_1 | 2019-03-29 17:20:38.733114 [-] "172.20.0.6" - - [29/Mar/2019:17:20:38 +0000] "GET / HTTP/1.0" 200 7677 "-" "-"
splash1_1 | 2019-03-29 17:20:39.072299 [-] "172.20.0.6" - - [29/Mar/2019:17:20:38 +0000] "GET / HTTP/1.0" 200 7677 "-" "-"
splash2_1 | 2019-03-29 17:20:39.372951 [-] "172.20.0.6" - - [29/Mar/2019:17:20:38 +0000] "GET / HTTP/1.0" 200 7677 "-" "-"
splash0_1 | 2019-03-29 17:20:40.735392 [-] "172.20.0.6" - - [29/Mar/2019:17:20:40 +0000] "GET / HTTP/1.0" 200 7677 "-" "-"
splash1_1 | 2019-03-29 17:20:41.075256 [-] "172.20.0.6" - - [29/Mar/2019:17:20:40 +0000] "GET / HTTP/1.0" 200 7677 "-" "-"
splash2_1 | 2019-03-29 17:20:41.379273 [-] "172.20.0.6" - - [29/Mar/2019:17:20:40 +0000] "GET / HTTP/1.0" 200 7677 "-" "-"
My crawler cannot connect however. It gives (running after docker run -p 8050:8050 scrapinghub/splash
works fine)
AttributeError: 'XYZSpider' object has no attribute 'crawler'
Also, I am running my scrapy-splash project inside a Virtualbox. It has 2 CPU cores given from the host PC. I created the aquarium instance with 3 insances of Splash. Is it the way you meant it? Basically I left all the attributes default, as I did not fully understand what did you meant with "Try to configure a splash instance per Core with aquarium"
Seems related to https://github.com/scrapinghub/splash/issues/892
I would just like to clear out I am not using any proxies or other middlewares.
@Erhanjinn Is the code above your actual code? I noticed it uses Request
instead of SplashRequest
.
@Erhanjinn Is the code above your actual code? I noticed it uses
Request
instead ofSplashRequest
.
Yes, it is.
In Splash documentation you say adding splash
key to meta
is enough. I tested this and it seemed working.
Hello,
I would like to use scrapy-splash to scrape multiple sites from one domain in parallel fashion. The site uses javascript to render some things I am interested in.
I do get mixed responses however. When setting
CONCURRENT_REQUESTS = 2
or even
CONCURRENT_REQUESTS = 1
the responses get mixed and are not 100% correct.
I am creating the requests as follows:
I tried to set
slot_policy
to bothSINGLE_SLOT
orPER_DOMAIN
and it did not help.What more should I set when scraping only one domain?
Thanks,
Jan