my8100 / scrapydweb

Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI. DEMO :point_right:
https://github.com/my8100/files
GNU General Public License v3.0
3.12k stars 555 forks source link

Twisted critical errors using scrapy-selenium with scrapydweb #164

Closed Tobeyforce closed 3 years ago

Tobeyforce commented 3 years ago

Describe the bug I am trying out the scrapy-selenium package from https://github.com/clemfromspace/scrapy-selenium and have deployed a spider on scrapydweb that implements the chromedriver. The spider works perfect when run without scrapydweb, but for some reason when I run it through scrapydweb I receive 2 critical errors by the end (although the items are succesfully scraped) Scrapydweb seems to be closing spider too fast. I have more than enough memory/cpu available. I don't encounter the same problem with another spider that also implements scrapy-selenium. Maybe some kind of timeout error? Perhaps both scrapydweb and scrapy-selenium tries to close the connection? I'm very confused because sometimes I get the errors, and sometimes not.

Logs

2020-09-29 22:20:16 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:49951/session/83d8e2b2e340df315cf410314d904510/window {}
2020-09-29 22:20:16 [urllib3.connectionpool] DEBUG: http://127.0.0.1:49951 "DELETE /session/83d8e2b2e340df315cf410314d904510/window HTTP/1.1" 200 12
2020-09-29 22:20:16 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2020-09-29 22:20:16 [scrapy.core.engine] INFO: Closing spider (finished)
2020-09-29 22:20:16 [scrapy.extensions.feedexport] INFO: Stored jsonlines feed (2 items) in: file:///home/ubuntu/my-scraper/items/my_scraper/stockholm_spider/2020-09-29T22_20_05.jl
2020-09-29 22:20:16 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:49951/session/83d8e2b2e340df315cf410314d904510 {}
2020-09-29 22:20:16 [urllib3.connectionpool] DEBUG: http://127.0.0.1:49951 "DELETE /session/83d8e2b2e340df315cf410314d904510 HTTP/1.1" 200 14
2020-09-29 22:20:16 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2020-09-29 22:20:16 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 376192,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 6.508231,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 9, 29, 22, 20, 16, 188822),
 'item_scraped_count': 2,
 'log_count/DEBUG': 49,
 'log_count/INFO': 11,
 'log_count/WARNING': 1,
 'memusage/max': 62394368,
 'memusage/startup': 62394368,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2020, 9, 29, 22, 20, 9, 680591)}
**2020-09-29 22:20:16 [scrapy.core.engine] INFO: Spider closed (finished)
2020-09-29 22:20:16 [twisted] CRITICAL: Unhandled error in Deferred:
2020-09-29 22:20:16 [twisted] CRITICAL: 
twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.**

Environment (please complete the following information):

my8100 commented 3 years ago

It may be an issue of Scrapyd or scrapy-selenium.

Tobeyforce commented 3 years ago

It may be an issue of Scrapyd or scrapy-selenium.

Alright, I will try to see if I can solve it from their end. Another question: Is it possible to schedule a spider to run at an interval of 45 seconds? I tried using */45 in the seconds tab of the extended timer settings. However, it will e.g run like this:

2020-09-30 01:27:45 2020-09-30 01:28:00 2020-09-30 01:28:45 2020-09-30 01:29:00

Why does it run with 15 second gap, then 45?

If I run /30 it will run every 30 second at interval, but /45 does not work.

my8100 commented 3 years ago

Because both 0 and 45 are divisible by 45. Does it make sense to schedule a spider run every 45 second? Why not do in a loop in the spider instead?

Tobeyforce commented 3 years ago

Thanks for the answer, I guess I don't really 100% understand how apscheduler works. In my case it makes sense (I think), I need to run spiders indefinitely every 45 seconds except perhaps on weekends. I could go with every minute or 30sec but I'm trying to hit a sweet spot.

Is it possible to schedule every 45 sec? In that case, how?

my8100 commented 3 years ago

Maybe you can add 3 tasks?

0 45 90 135 180

0’0 0’45 1’30 2’15 3’0 3’45 4’30 5’15 6’0 6’45 7’30 8’15

min: */3 sec: 0,45

min: 1,4,7,10,13,16,19,22,25,28,31,34,37,40,43,46,49,52,55,58 sec: 30

min:2,5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50,53,56,59 sec: 15

my8100 commented 3 years ago

FYI https://apscheduler.readthedocs.io/en/stable/modules/triggers/cron.html

Tobeyforce commented 3 years ago

Huge thanks, I really appreciate you taking your time This worked, I'll just have to make multiple timers then. Only problem is I will have 300 timer tasks for 100 spiders, haha! Thanks again spiderman, you are the boss.