hesussavas / corners_stats

Statistical project on soccer's corners
40 stars 2 forks source link

No results from "sudo make start_scraping" #3

Closed u015216 closed 7 years ago

u015216 commented 7 years ago

Okay, the additional sleep time did the trick for the connection issue, however the application seems to run without error but I don't believe any results are returned, or its not scraping the pages in general. Below is the log from the command line (running: sudo make start_scraping). Would it be possible for you to review and let me know if this is just an user error (my issue)?

docker build --file=Dockerfile -t corners/bash:dev . Sending build context to Docker daemon 64.43MB Step 1/8 : FROM python:3.6 ---> c5700ee6fe7b Step 2/8 : RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends python-pip ---> Using cache ---> 6cbb5353ee2d Step 3/8 : COPY requirements.txt /tmp/ ---> Using cache ---> 866ef44e2a1d Step 4/8 : ENV MPLBACKEND "agg" ---> Using cache ---> f42f5740388f Step 5/8 : RUN pip install -r /tmp/requirements.txt ---> Using cache ---> 6b4e3881ce3a Step 6/8 : COPY . /opt/corners ---> Using cache ---> e8ab01dbaa96 Step 7/8 : WORKDIR /opt/corners ---> Using cache ---> 5ec54d6c9858 Step 8/8 : CMD bash ---> Using cache ---> 6e4afe087133 Successfully built 6e4afe087133 Successfully tagged corners/bash:dev docker run -d -e POSTGRES_PASSWORD=corners -e POSTGRES_USER=corners -e POSTGRES_DB=corners -p 8432:5432 --name corners-postgres postgres:9.5 0f2a214ff19d561f69725586d510323e323be704745eb8162199492990f18b1a sleep 20 docker run --rm -i --link corners-postgres -e DEV_PSQL_URI=postgresql://corners:corners@corners-postgres:5432/corners corners/bash:dev ./start.sh 2017-07-19 18:15:06 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: corners442) 2017-07-19 18:15:06 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'corners442', 'CONCURRENT_REQUESTS': 4, 'CONCURRENT_REQUESTS_PER_DOMAIN': 2, 'CONCURRENT_REQUESTS_PER_IP': 2, 'DOWNLOAD_DELAY': 5, 'NEWSPIDER_MODULE': 'corners442.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['corners442.spiders'], 'USER_AGENT': 'Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405'} 2017-07-19 18:15:07 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2017-07-19 18:15:07 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-07-19 18:15:07 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-07-19 18:15:07 [scrapy.middleware] INFO: Enabled item pipelines: ['corners442.pipelines.LeaguePipeline'] 2017-07-19 18:15:07 [scrapy.core.engine] INFO: Spider opened 2017-07-19 18:15:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-07-19 18:15:07 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-07-19 18:15:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fourfourtwo.com/robots.txt> (referer: None) 2017-07-19 18:15:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fourfourtwo.com/> (referer: None) 2017-07-19 18:15:08 [scrapy.core.engine] INFO: Closing spider (finished) 2017-07-19 18:15:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 628, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 17119, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 7, 19, 18, 15, 8, 263953), 'log_count/DEBUG': 3, 'log_count/INFO': 7, 'response_received_count': 2, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2017, 7, 19, 18, 15, 7, 329204)} 2017-07-19 18:15:08 [scrapy.core.engine] INFO: Spider closed (finished) 2017-07-19 18:15:19 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: corners442) 2017-07-19 18:15:19 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'corners442', 'CONCURRENT_REQUESTS': 4, 'CONCURRENT_REQUESTS_PER_DOMAIN': 2, 'CONCURRENT_REQUESTS_PER_IP': 2, 'DOWNLOAD_DELAY': 5, 'NEWSPIDER_MODULE': 'corners442.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['corners442.spiders'], 'USER_AGENT': 'Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405'} 2017-07-19 18:15:19 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2017-07-19 18:15:19 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-07-19 18:15:19 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-07-19 18:15:19 [scrapy.middleware] INFO: Enabled item pipelines: ['corners442.pipelines.Corners442Pipeline'] 2017-07-19 18:15:19 [scrapy.core.engine] INFO: Spider opened 2017-07-19 18:15:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-07-19 18:15:19 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-07-19 18:15:19 [scrapy.core.engine] INFO: Closing spider (finished) 2017-07-19 18:15:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 7, 19, 18, 15, 19, 517179), 'log_count/DEBUG': 1, 'log_count/INFO': 7, 'start_time': datetime.datetime(2017, 7, 19, 18, 15, 19, 500297)} 2017-07-19 18:15:19 [scrapy.core.engine] INFO: Spider closed (finished)

hesussavas commented 7 years ago

Sorry, my bad. Somehow I've changed the url which gets the leagues and seasons with the fourfourtwo start page. Fixed it with this commit (https://github.com/hesussavas/corners_stats/commit/8001758bad021aec56b05439f6ca7fee8df4ffe1). Please, try again.

u015216 commented 7 years ago

Took a while (almost 2 days) for the scrape. However, both the scrape and the analysis worked flawlessly after the updates you made.

Greatly appreciated that you shared this!

Out of curiosity, would you be interested in performing other scrapes if given the opportunity?

hesussavas commented 7 years ago

Glad you've succeeded with it! And thanks for testing it.

About the other scrapes: it depends, but I'm not eager, definitely. Initially, I've had an intention to get along with the Scrapy library and nothing more. So, initial mission's completed :)