scrapy-plugins / scrapy-splash

Scrapy+Splash for JavaScript integration
BSD 3-Clause "New" or "Revised" License
3.15k stars 450 forks source link

Concurrency is not handled properly #213

Closed Erhanjinn closed 4 years ago

Erhanjinn commented 5 years ago

Hello,

I would like to use scrapy-splash to scrape multiple sites from one domain in parallel fashion. The site uses javascript to render some things I am interested in.

I do get mixed responses however. When setting

CONCURRENT_REQUESTS = 2

or even

CONCURRENT_REQUESTS = 1

the responses get mixed and are not 100% correct.

I am creating the requests as follows:

yield scrapy.Request(url,
                     self.parse,
                     headers={'User-Agent': self.custom_user_agent},
                     meta={'splash': {'args': {'wait': 15},
                           'endpoint': 'render.html',
                           'slot_policy': SlotPolicy.SINGLE_SLOT})

I tried to set slot_policy to both SINGLE_SLOT or PER_DOMAIN and it did not help.

What more should I set when scraping only one domain?

Thanks,

Jan

JavierRuano commented 5 years ago

Try to configure a splash instance per Core with aquarium after you could write to us again. (The number of instances permit the concurrency of the executions)

https://github.com/TeamHG-Memex/aquarium

El vie., 29 mar. 2019 16:23, Erhanjinn notifications@github.com escribió:

Hello,

I would like to use scrapy-splash to scrape multiple sites from one domain in parallel fashion. The site uses javascript to render some things I am interested in.

I do get mixed responses however. When setting

CONCURRENT_REQUESTS = 2

or even

CONCURRENT_REQUESTS = 1

the responses get mixed and are not 100% correct.

I am creating the requests as follows:

yield scrapy.Request(url, self.parse, headers={'User-Agent': self.custom_user_agent}, meta={'splash': {'args': {'wait': 15}, 'endpoint': 'render.html', 'slot_policy': SlotPolicy.SINGLE_SLOT})

I tried to set slot_policy to both SINGLE_SLOT or PER_DOMAIN and it did not help.

What more should I set when scraping only one domain?

Thanks,

Jan

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scrapy-plugins/scrapy-splash/issues/213, or mute the thread https://github.com/notifications/unsubscribe-auth/Agwyu7tegEw93wTB-B4J9uMsplwU2JVcks5vbi_vgaJpZM4cSotR .

Erhanjinn commented 5 years ago

Thank you for your reply.

I followed the guide you have provided, but I have a question. Is there anything special I have to do to connect to this aquarium splash instance? Using SPLASH_URL = 'http://0.0.0.0:8050' in settings.py doesn't work.

Now, instead of

docker run -p 8050:8050 scrapinghub/splash

I ran

docker-compose up

in the aquarium folder created and I can see it is live

splash1_1  | 2019-03-29 17:20:37.065053 [-] "172.20.0.6" - - [29/Mar/2019:17:20:36 +0000] "GET / HTTP/1.0" 200 7677 "-" "-"
splash2_1  | 2019-03-29 17:20:37.370812 [-] "172.20.0.6" - - [29/Mar/2019:17:20:36 +0000] "GET / HTTP/1.0" 200 7677 "-" "-"
splash0_1  | 2019-03-29 17:20:38.733114 [-] "172.20.0.6" - - [29/Mar/2019:17:20:38 +0000] "GET / HTTP/1.0" 200 7677 "-" "-"
splash1_1  | 2019-03-29 17:20:39.072299 [-] "172.20.0.6" - - [29/Mar/2019:17:20:38 +0000] "GET / HTTP/1.0" 200 7677 "-" "-"
splash2_1  | 2019-03-29 17:20:39.372951 [-] "172.20.0.6" - - [29/Mar/2019:17:20:38 +0000] "GET / HTTP/1.0" 200 7677 "-" "-"
splash0_1  | 2019-03-29 17:20:40.735392 [-] "172.20.0.6" - - [29/Mar/2019:17:20:40 +0000] "GET / HTTP/1.0" 200 7677 "-" "-"
splash1_1  | 2019-03-29 17:20:41.075256 [-] "172.20.0.6" - - [29/Mar/2019:17:20:40 +0000] "GET / HTTP/1.0" 200 7677 "-" "-"
splash2_1  | 2019-03-29 17:20:41.379273 [-] "172.20.0.6" - - [29/Mar/2019:17:20:40 +0000] "GET / HTTP/1.0" 200 7677 "-" "-"

My crawler cannot connect however. It gives (running after docker run -p 8050:8050 scrapinghub/splash works fine)

AttributeError: 'XYZSpider' object has no attribute 'crawler'

Also, I am running my scrapy-splash project inside a Virtualbox. It has 2 CPU cores given from the host PC. I created the aquarium instance with 3 insances of Splash. Is it the way you meant it? Basically I left all the attributes default, as I did not fully understand what did you meant with "Try to configure a splash instance per Core with aquarium"

Gallaecio commented 5 years ago

Seems related to https://github.com/scrapinghub/splash/issues/892

Erhanjinn commented 5 years ago

I would just like to clear out I am not using any proxies or other middlewares.

Gallaecio commented 5 years ago

@Erhanjinn Is the code above your actual code? I noticed it uses Request instead of SplashRequest.

Erhanjinn commented 5 years ago

@Erhanjinn Is the code above your actual code? I noticed it uses Request instead of SplashRequest.

Yes, it is.

In Splash documentation you say adding splash key to meta is enough. I tested this and it seemed working.

Gallaecio commented 5 years ago

Have you configured your project as described in the README? (not configuring the Splash-specific duplicate filter may explain your issues)