OSS Gate Workshop: Tokyo: 2018-06-09: treby: Scrapy: Work log

treby commented 6 years ago

This is a work log of a "OSS Gate workshop". "OSS Gate workshop" is an activity to increase OSS developers. Here's been discussed in Japanese. Thanks.

作業ログ作成時の説明

以下のテンプレートを埋めてタイトルに設定します。埋め方例はスクロールすると見えてきます。

OSS Gate Workshop: ${LOCATION}: ${YEAR}-${MONTH}-${DAY}: ${ACCOUNT_NAME}: ${OSS_NAME}: Work log

タイトル例↓：

OSS Gate Workshop: Tokyo: 2017-01-16: kou: Rabbit: Work log

OSS Gateワークショップ関連情報

スライド：ワークショップの進行に使っているスライドがあります。
チャット：OSS開発に関することならなんでも相談できます。ワークショップが終わった後もオンラインで相談しながら継続的にOSSの開発に参加しましょう！
シナリオ：ワークショップの目的・内容・進め方の詳細が書いています。
過去のビギナーの作業ログ：他の人の作業ログから学べることがいろいろあるはずです。

treby commented 6 years ago

GitHubのIssues一覧取れるようにしてみる。まずは環境構築のためにvirtualenv使う http://o-tomox.hatenablog.com/entry/2013/07/18/231204

treby commented 6 years ago

すでに手元の環境にはvirtualenv入っているようだ

treby commented 6 years ago

mkvirtualenv --python=/usr/local/bin/python3 workon-scrapy としたい

❯ which python3
python3 not found

~/myprj/scrapy
❯ brew install python3
Updating Homebrew...

treby commented 6 years ago

brewでこけた……

==> Auto-updated Homebrew!
Updated 2 taps (codekitchen/dinghy, homebrew/core).
==> New Formulae
cointop                                                                                  mysql-client
==> Updated Formulae
harfbuzz ✔                   conan                        glibmm                       libspectrum                  percona-xtrabackup           snakemake
imagemagick ✔                conjure-up                   gnupg                        libvirt                      pgroonga                     sqldiff
node ✔                       dbhash                       gnuplot                      libxmlsec1                   php-code-sniffer             sqlite-analyzer
numpy ✔                      django-completion            go                           libzdb                       php-cs-fixer                 ssh-vault
sqlite ✔                     dmd                          go@1.9                       llnode                       phpunit                      storm
agda                         dnscrypt-proxy               godep                        macvim                       picard-tools                 streamlink
amazon-ecs-cli               docker-swarm                 goenv                        maxwell                      pkcs11-helper                syncthing
angular-cli                  draco                        gopass                       mercurial                    pngquant                     sysbench
annie                        druid                        goreleaser                   midnight-commander           pony-stable                  tarsnap-gui
ansible                      dscanner                     gradle                       mill                         ponyc                        telegraf
arangodb                     dynare                       hadolint                     minetest                     ppsspp                       teleport
arpack                       emscripten                   homeshick                    mint                         proj                         tkdiff
artifactory                  erlang                       hub                          monero                       prometheus                   tokei
aws-sdk-cpp                  etcd                         hydra                        mongo-c-driver               pstoedit                     traefik
azure-cli                    exiftool                     imagemagick@6                mongo-cxx-driver             puzzles                      typescript
basex                        faas-cli                     immortal                     mydumper                     qcachegrind                  uhd
bash                         fio                          innotop                      mysql++                      qrencode                     vault
bazel                        firebase-cli                 inspectrum                   mysql-connector-c++          quicktype                    vert.x
bento4                       flow                         iozone                       mytop                        rocksdb                      webpack
bibutils                     fn                           jenkins                      nano                         rubberband                   winetricks
bro                          folly                        jenkins-job-builder          nanomsg                      rust                         wolfssl
bzt                          fribidi                      jenkins-lts                  nginx                        sbt@0.13                     wslay
cayley                       frugal                       jfrog-cli-go                 node-build                   securefs                     xonsh
certbot                      fuse-emulator                kitchen-sync                 ntl                          shellharden                  xtensor
ceylon                       futhark                      kompose                      ntopng                       singular                     youtube-dl
chronograf                   gist                         kubernetes-cli               odpi                         siril                        zabbix
clblast                      git-town                     libplctag                    ohcount                      sjk
codekitchen/dinghy/dinghy    gitlab-gem                   librealsense                 percona-toolkit              skaffold
==> Deleted Formulae
luciddb

Error: python 2.7.13 is already installed
To upgrade to 3.6.5, run `brew upgrade python`

treby commented 6 years ago

~/myprj/scrapy
❯ brew upgrade python
==> Upgrading 1 outdated package, with result:
python 2.7.13 -> 3.6.5
==> Upgrading python
==> Installing dependencies for python: sqlite, xz
==> Installing python dependency: sqlite
==> Downloading https://homebrew.bintray.com/bottles/sqlite-3.24.0.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring sqlite-3.24.0.high_sierra.bottle.tar.gz
==> Caveats
Homebrew has detected an existing SQLite history file that was created
with the editline library. The current version of this formula is
built with Readline. To back up and convert your history file so that
it can be used with Readline, run:

  sed -i~ 's/\\040/ /g' ~/.sqlite_history

before using the `sqlite` command-line tool again. Otherwise, your
history will be lost.

This formula is keg-only, which means it was not symlinked into /usr/local,
because macOS provides an older sqlite3.

If you need to have this software first in your PATH run:
  echo 'export PATH="/usr/local/opt/sqlite/bin:$PATH"' >> ~/.zshrc

For compilers to find this software you may need to set:
    LDFLAGS:  -L/usr/local/opt/sqlite/lib
    CPPFLAGS: -I/usr/local/opt/sqlite/include
For pkg-config to find this software you may need to set:
    PKG_CONFIG_PATH: /usr/local/opt/sqlite/lib/pkgconfig

==> Summary
🍺  /usr/local/Cellar/sqlite/3.24.0: 11 files, 3.5MB
==> Installing python dependency: xz
==> Downloading https://homebrew.bintray.com/bottles/xz-5.2.4.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring xz-5.2.4.high_sierra.bottle.tar.gz
🍺  /usr/local/Cellar/xz/5.2.4: 92 files, 1MB

treby commented 6 years ago

入ったかな

treby commented 6 years ago

https://scrapy.org/ より

pip install scrapy

やてみる

treby commented 6 years ago

実行ログ

``` ❯ pip install scrapy Collecting scrapy Downloading https://files.pythonhosted.org/packages/db/9c/cb15b2dc6003a805afd21b9b396e0e965800765b51da72fe17cf340b9be2/Scrapy-1.5.0-py2.py3-none-any.whl (251kB) 100% |████████████████████████████████| 256kB 188kB/s Requirement already satisfied: six>=1.5.2 in /usr/local/lib/python2.7/site-packages (from scrapy) (1.11.0) Collecting w3lib>=1.17.0 (from scrapy) Downloading https://files.pythonhosted.org/packages/37/94/40c93ad0cadac0f8cb729e1668823c71532fd4a7361b141aec535acb68e3/w3lib-1.19.0-py2.py3-none-any.whl Collecting queuelib (from scrapy) Downloading https://files.pythonhosted.org/packages/4c/85/ae64e9145f39dd6d14f8af3fa809a270ef3729f3b90b3c0cf5aa242ab0d4/queuelib-1.5.0-py2.py3-none-any.whl Collecting Twisted>=13.1.0 (from scrapy) Downloading https://files.pythonhosted.org/packages/12/2a/e9e4fb2e6b2f7a75577e0614926819a472934b0b85f205ba5d5d2add54d0/Twisted-18.4.0.tar.bz2 (3.0MB) 100% |████████████████████████████████| 3.0MB 178kB/s Collecting PyDispatcher>=2.0.5 (from scrapy) Downloading https://files.pythonhosted.org/packages/cd/37/39aca520918ce1935bea9c356bcbb7ed7e52ad4e31bff9b943dfc8e7115b/PyDispatcher-2.0.5.tar.gz Collecting cssselect>=0.9 (from scrapy) Downloading https://files.pythonhosted.org/packages/7b/44/25b7283e50585f0b4156960691d951b05d061abf4a714078393e51929b30/cssselect-1.0.3-py2.py3-none-any.whl Collecting pyOpenSSL (from scrapy) Downloading https://files.pythonhosted.org/packages/96/af/9d29e6bd40823061aea2e0574ccb2fcf72bfd6130ce53d32773ec375458c/pyOpenSSL-18.0.0-py2.py3-none-any.whl (53kB) 100% |████████████████████████████████| 61kB 208kB/s Collecting parsel>=1.1 (from scrapy) Downloading https://files.pythonhosted.org/packages/bc/b4/2fd37d6f6a7e35cbc4c2613a789221ef1109708d5d4fb9fd5f6f721a43c9/parsel-1.4.0-py2.py3-none-any.whl Collecting service-identity (from scrapy) Downloading https://files.pythonhosted.org/packages/29/fa/995e364220979e577e7ca232440961db0bf996b6edaf586a7d1bd14d81f1/service_identity-17.0.0-py2.py3-none-any.whl Collecting lxml (from scrapy) Downloading https://files.pythonhosted.org/packages/18/95/abf8204fbbc9a01e0e156029cd1ee974237b5798b9e84477df6c4fabfbd2/lxml-4.2.1-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (8.8MB) 100% |████████████████████████████████| 8.8MB 223kB/s Collecting zope.interface>=4.4.2 (from Twisted>=13.1.0->scrapy) Downloading https://files.pythonhosted.org/packages/ac/8a/657532df378c2cd2a1fe6b12be3b4097521570769d4852ec02c24bd3594e/zope.interface-4.5.0.tar.gz (151kB) 100% |████████████████████████████████| 153kB 194kB/s Collecting constantly>=15.1 (from Twisted>=13.1.0->scrapy) Downloading https://files.pythonhosted.org/packages/b9/65/48c1909d0c0aeae6c10213340ce682db01b48ea900a7d9fce7a7910ff318/constantly-15.1.0-py2.py3-none-any.whl Collecting incremental>=16.10.1 (from Twisted>=13.1.0->scrapy) Downloading https://files.pythonhosted.org/packages/f5/1d/c98a587dc06e107115cf4a58b49de20b19222c83d75335a192052af4c4b7/incremental-17.5.0-py2.py3-none-any.whl Collecting Automat>=0.3.0 (from Twisted>=13.1.0->scrapy) Downloading https://files.pythonhosted.org/packages/17/6a/1baf488c2015ecafda48c03ca984cf0c48c254622668eb1732dbe2eae118/Automat-0.6.0-py2.py3-none-any.whl Collecting hyperlink>=17.1.1 (from Twisted>=13.1.0->scrapy) Downloading https://files.pythonhosted.org/packages/a7/b6/84d0c863ff81e8e7de87cff3bd8fd8f1054c227ce09af1b679a8b17a9274/hyperlink-18.0.0-py2.py3-none-any.whl Collecting cryptography>=2.2.1 (from pyOpenSSL->scrapy) Downloading https://files.pythonhosted.org/packages/58/c1/23bea66007d4be75ce02056fac665f9a207535e89fb3c7931420fa4a5f57/cryptography-2.2.2-cp27-cp27m-macosx_10_6_intel.whl (1.5MB) 100% |████████████████████████████████| 1.5MB 152kB/s Collecting attrs (from service-identity->scrapy) Downloading https://files.pythonhosted.org/packages/41/59/cedf87e91ed541be7957c501a92102f9cc6363c623a7666d69d51c78ac5b/attrs-18.1.0-py2.py3-none-any.whl Collecting pyasn1-modules (from service-identity->scrapy) Downloading https://files.pythonhosted.org/packages/e9/51/bcd96bf6231d4b2cc5e023c511bee86637ba375c44a6f9d1b4b7ad1ce4b9/pyasn1_modules-0.2.1-py2.py3-none-any.whl (60kB) 100% |████████████████████████████████| 61kB 143kB/s Requirement already satisfied: pyasn1 in /usr/local/lib/python2.7/site-packages (from service-identity->scrapy) (0.4.2) Requirement already satisfied: setuptools in /usr/local/lib/python2.7/site-packages (from zope.interface>=4.4.2->Twisted>=13.1.0->scrapy) (39.1.0) Collecting idna>=2.5 (from hyperlink>=17.1.1->Twisted>=13.1.0->scrapy) Downloading https://files.pythonhosted.org/packages/27/cc/6dd9a3869f15c2edfab863b992838277279ce92663d334df9ecf5106f5c6/idna-2.6-py2.py3-none-any.whl (56kB) 100% |████████████████████████████████| 61kB 142kB/s Collecting cffi>=1.7; platform_python_implementation != "PyPy" (from cryptography>=2.2.1->pyOpenSSL->scrapy) Downloading https://files.pythonhosted.org/packages/7e/4a/b647e46faaa2dcfb16069b6aad2d8509982fd63710a325b8ad7db80f18be/cffi-1.11.5-cp27-cp27m-macosx_10_6_intel.whl (238kB) 100% |████████████████████████████████| 245kB 175kB/s Collecting enum34; python_version < "3" (from cryptography>=2.2.1->pyOpenSSL->scrapy) Downloading https://files.pythonhosted.org/packages/c5/db/e56e6b4bbac7c4a06de1c50de6fe1ef3810018ae11732a50f15f62c7d050/enum34-1.1.6-py2-none-any.whl Collecting asn1crypto>=0.21.0 (from cryptography>=2.2.1->pyOpenSSL->scrapy) Downloading https://files.pythonhosted.org/packages/ea/cd/35485615f45f30a510576f1a56d1e0a7ad7bd8ab5ed7cdc600ef7cd06222/asn1crypto-0.24.0-py2.py3-none-any.whl (101kB) 100% |████████████████████████████████| 102kB 180kB/s Collecting ipaddress; python_version < "3" (from cryptography>=2.2.1->pyOpenSSL->scrapy) Downloading https://files.pythonhosted.org/packages/fc/d0/7fc3a811e011d4b388be48a0e381db8d990042df54aa4ef4599a31d39853/ipaddress-1.0.22-py2.py3-none-any.whl Collecting pycparser (from cffi>=1.7; platform_python_implementation != "PyPy"->cryptography>=2.2.1->pyOpenSSL->scrapy) Downloading https://files.pythonhosted.org/packages/8c/2d/aad7f16146f4197a11f8e91fb81df177adcc2073d36a17b1491fd09df6ed/pycparser-2.18.tar.gz (245kB) 100% |████████████████████████████████| 256kB 201kB/s Building wheels for collected packages: Twisted, PyDispatcher, zope.interface, pycparser Running setup.py bdist_wheel for Twisted ... done Stored in directory: /Users/treby/Library/Caches/pip/wheels/b3/76/f7/85353c829c0881e74b5366ce0ed59042b098bb4903e2da8828 Running setup.py bdist_wheel for PyDispatcher ... done Stored in directory: /Users/treby/Library/Caches/pip/wheels/88/99/96/cfef6665f9cb1522ee6757ae5955feedf2fe25f1737f91fa7f Running setup.py bdist_wheel for zope.interface ... done Stored in directory: /Users/treby/Library/Caches/pip/wheels/c6/b2/d2/be6785a207eaa58d76debc10c9d5c66196b40a88abb61d6af7 Running setup.py bdist_wheel for pycparser ... done Stored in directory: /Users/treby/Library/Caches/pip/wheels/c0/a1/27/5ba234bd77ea5a290cbf6d675259ec52293193467a12ef1f46 Successfully built Twisted PyDispatcher zope.interface pycparser Installing collected packages: w3lib, queuelib, zope.interface, constantly, incremental, attrs, Automat, idna, hyperlink, Twisted, PyDispatcher, cssselect, pycparser, cffi, enum34, asn1crypto, ipaddress, cryptography, pyOpenSSL, lxml, parsel, pyasn1-modules, service-identity, scrapy Successfully installed Automat-0.6.0 PyDispatcher-2.0.5 Twisted-18.4.0 asn1crypto-0.24.0 attrs-18.1.0 cffi-1.11.5 constantly-15.1.0 cryptography-2.2.2 cssselect-1.0.3 enum34-1.1.6 hyperlink-18.0.0 idna-2.6 incremental-17.5.0 ipaddress-1.0.22 lxml-4.2.1 parsel-1.4.0 pyOpenSSL-18.0.0 pyasn1-modules-0.2.1 pycparser-2.18 queuelib-1.5.0 scrapy-1.5.0 service-identity-17.0.0 w3lib-1.19.0 zope.interface-4.5.0 ```

treby commented 6 years ago

https://scrapy.org/ 続行

treby commented 6 years ago

❯ cat helloworld.py
import scrapy

class BlogSpider(scrapy.Spider):
  name = 'blogspider'
  start_urls = ['https://blog.scrapinghub.com']

  def parse(self, response):
    for title in response.css('h2.entry-title'):
      yield {'title': title.css('a ::text').extract_first()}

    for next_page in response.css('div.prev-post > a'):
      yield response.follow(next_page, self.parse)

実行

``` ❯ scrapy runspider helloworld.py 2018-06-09 14:35:24 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot) 2018-06-09 14:35:24 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.4.0, Python 2.7.15 (default, May 29 2018, 13:17:30) - [GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.1)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Darwin-17.5.0-x86_64-i386-64bit 2018-06-09 14:35:24 [scrapy.crawler] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True} 2018-06-09 14:35:24 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats'] 2018-06-09 14:35:24 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-06-09 14:35:24 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-06-09 14:35:24 [scrapy.middleware] INFO: Enabled item pipelines: [] 2018-06-09 14:35:24 [scrapy.core.engine] INFO: Spider opened 2018-06-09 14:35:24 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-06-09 14:35:24 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2018-06-09 14:35:26 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2018-06-09 14:35:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com> {'title': u'Want to Predict Fitbit\u2019s Quarterly Revenue? Eagle Alpha Did It Using Web Scraped Product Data'} 2018-06-09 14:35:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com> {'title': u'How Data Compliance Companies Are Turning To Web Crawlers To Take Advantage of the GDPR Business Opportunity'} 2018-06-09 14:35:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com> {'title': u'Looking Back at 2017'} 2018-06-09 14:35:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com> {'title': u'A Faster, Updated Scrapinghub'} 2018-06-09 14:35:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com> {'title': u'Scraping the Steam Game Store with Scrapy'} 2018-06-09 14:35:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com> {'title': u'Do Androids Dream of Electric Sheep?'} 2018-06-09 14:35:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com> {'title': u'Deploy your Scrapy Spiders from GitHub'} 2018-06-09 14:35:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com> {'title': u'Looking Back at 2016'} 2018-06-09 14:35:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com> {'title': u'How to Increase Sales with Online Reputation Management'} 2018-06-09 14:35:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com> {'title': u'How to Build your own Price Monitoring Tool'} 2018-06-09 14:35:27 [scrapy.core.engine] DEBUG: Crawled (200) (referer: https://blog.scrapinghub.com) 2018-06-09 14:35:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/2/> {'title': u'How You Can Use Web Data to Accelerate Your Startup'} 2018-06-09 14:35:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/2/> {'title': u'An Introduction to XPath: How to Get Started'} 2018-06-09 14:35:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/2/> {'title': u'Why Promoting Open Data Increases Economic Opportunities'} 2018-06-09 14:35:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/2/> {'title': u'Interview: How Up Hail uses Scrapy to Increase Transparency'} 2018-06-09 14:35:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/2/> {'title': u'How to Run Python Scripts in Scrapy Cloud'} 2018-06-09 14:35:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/2/> {'title': u'Embracing the Future of Work: How To Communicate Remotely'} 2018-06-09 14:35:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/2/> {'title': u'How to Deploy Custom Docker Images for Your Web Crawlers'} 2018-06-09 14:35:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/2/> {'title': u'Improved Frontera: Web Crawling at Scale with Python 3 Support'} 2018-06-09 14:35:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/2/> {'title': u'How to Crawl the Web Politely with Scrapy'} 2018-06-09 14:35:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/2/> {'title': u'Introducing Scrapy Cloud with Python 3 Support'} 2018-06-09 14:35:28 [scrapy.core.engine] DEBUG: Crawled (200) (referer: https://blog.scrapinghub.com/page/2/) 2018-06-09 14:35:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/3/> {'title': u'What the Suicide Squad Tells Us About Web Data'} 2018-06-09 14:35:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/3/> {'title': u'This Month in Open Source at Scrapinghub August 2016'} 2018-06-09 14:35:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/3/> {'title': u'Meet Parsel: the Selector Library behind Scrapy'} 2018-06-09 14:35:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/3/> {'title': u'Incremental Crawls with Scrapy and DeltaFetch'} 2018-06-09 14:35:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/3/> {'title': u'Improving Access to Peruvian Congress Bills with Scrapy'} 2018-06-09 14:35:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/3/> {'title': u'Scrapely: The Brains Behind Portia Spiders'} 2018-06-09 14:35:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/3/> {'title': u'Introducing Portia2Code: Portia Projects into Scrapy Spiders'} 2018-06-09 14:35:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/3/> {'title': u'Scraping Infinite Scrolling Pages'} 2018-06-09 14:35:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/3/> {'title': u'This Month in Open Source at Scrapinghub June 2016'} 2018-06-09 14:35:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/3/> {'title': u'Introducing the Datasets Catalog'} 2018-06-09 14:35:30 [scrapy.core.engine] DEBUG: Crawled (200) (referer: https://blog.scrapinghub.com/page/3/) 2018-06-09 14:35:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/4/> {'title': u'Introducing the Crawlera Dashboard'} 2018-06-09 14:35:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/4/> {'title': u'Data Extraction with Scrapy and Python 3'} 2018-06-09 14:35:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/4/> {'title': u'How to Debug your Scrapy Spiders'} 2018-06-09 14:35:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/4/> {'title': u'Scrapy + MonkeyLearn: Textual Analysis of Web Data'} 2018-06-09 14:35:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/4/> {'title': u'Introducing Scrapy Cloud 2.0'} 2018-06-09 14:35:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/4/> {'title': u'A (not so) Short Story on Getting Decent Internet Access'} 2018-06-09 14:35:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/4/> {'title': u'Scraping Websites Based on ViewStates with Scrapy'} 2018-06-09 14:35:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/4/> {'title': u'Machine Learning with Web Scraping: New MonkeyLearn Addon'} 2018-06-09 14:35:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/4/> {'title': u'Mapping Corruption in the Panama Papers with Open Data'} 2018-06-09 14:35:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/4/> {'title': u'Web Scraping to Create Open Data'} 2018-06-09 14:35:31 [scrapy.core.engine] DEBUG: Crawled (200) (referer: https://blog.scrapinghub.com/page/4/) 2018-06-09 14:35:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/5/> {'title': u'Scrapy Tips from the Pros: March 2016 Edition'} 2018-06-09 14:35:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/5/> {'title': u'This Month in Open Source at Scrapinghub March 2016'} 2018-06-09 14:35:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/5/> {'title': u'Join Scrapinghub for Google Summer of Code 2016'} 2018-06-09 14:35:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/5/> {'title': u'How Web Scraping is Revealing Lobbying and Corruption in Peru'} 2018-06-09 14:35:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/5/> {'title': u'Splash 2.0 Is Here with Qt 5 and Python 3'} 2018-06-09 14:35:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/5/> {'title': u'Migrate your Kimono Projects to Portia'} 2018-06-09 14:35:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/5/> {'title': u'Scrapy Tips from the Pros: February 2016 Edition'} 2018-06-09 14:35:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/5/> {'title': u'Portia: The Open Source Alternative to Kimono Labs'} 2018-06-09 14:35:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/5/> {'title': u'Web Scraping Finds Stores Guilty of Price Inflation'} 2018-06-09 14:35:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/5/> {'title': u'Python 3 is Coming to Scrapy'} 2018-06-09 14:35:33 [scrapy.core.engine] DEBUG: Crawled (200) (referer: https://blog.scrapinghub.com/page/5/) 2018-06-09 14:35:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/6/> {'title': u'Happy Anniversary: Scrapinghub Turns 5'} 2018-06-09 14:35:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/6/> {'title': u'Scrapy Tips from the Pros: Part 1'} 2018-06-09 14:35:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/6/> {'title': u'Vizlegal: Rise of Machine-Readable Laws and Court Judgments'} 2018-06-09 14:35:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/6/> {'title': u'Christmas Eve vs New Year\u2019s Eve: Last Minute Price Inflation?'} 2018-06-09 14:35:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/6/> {'title': u'Looking Back at 2015'} 2018-06-09 14:35:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/6/> {'title': u'Winter Sales Showdown: Black Friday vs Cyber Monday vs Green Monday'} 2018-06-09 14:35:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/6/> {'title': u'Chats With RINAR Solutions'} 2018-06-09 14:35:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/6/> {'title': u'Black Friday, Cyber Monday: Are They Worth It?'} 2018-06-09 14:35:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/6/> {'title': u'Tips for Creating a Cohesive Company Culture Remotely'} 2018-06-09 14:35:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/6/> {'title': u'Parse Natural Language Dates with Dateparser'} 2018-06-09 14:35:34 [scrapy.core.engine] DEBUG: Crawled (200) (referer: https://blog.scrapinghub.com/page/6/) 2018-06-09 14:35:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/7/> {'title': u'Aduana: Link Analysis to Crawl the Web at Scale'} 2018-06-09 14:35:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/7/> {'title': u'Scrapy on the Road to Python 3 Support'} 2018-06-09 14:35:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/7/> {'title': u'Introducing Javascript support for Portia'} 2018-06-09 14:35:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/7/> {'title': u'Distributed Frontera: Web Crawling at Scale'} 2018-06-09 14:35:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/7/> {'title': u'The Road to Loading JavaScript in Portia'} 2018-06-09 14:35:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/7/> {'title': u'EuroPython 2015'} 2018-06-09 14:35:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/7/> {'title': u'StartupChats Remote Working Q&A'} 2018-06-09 14:35:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/7/> {'title': u'PyCon Philippines 2015'} 2018-06-09 14:35:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/7/> {'title': u'Google Summer of Code 2015'} 2018-06-09 14:35:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/7/> {'title': u'Link Analysis Algorithms Explained'} 2018-06-09 14:35:35 [scrapy.core.engine] DEBUG: Crawled (200) (referer: https://blog.scrapinghub.com/page/7/) 2018-06-09 14:35:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/8/> {'title': u'EuroPython, here we go!'} 2018-06-09 14:35:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/8/> {'title': u'Using git to manage vacations in a large distributed team'} 2018-06-09 14:35:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/8/> {'title': u'Gender Inequality Across Programming Languages'} 2018-06-09 14:35:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/8/> {'title': u'Traveling Tips for Remote Workers'} 2018-06-09 14:35:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/8/> {'title': u'A Career in Remote Working'} 2018-06-09 14:35:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/8/> {'title': u'Frontera: The Brain Behind the Crawls'} 2018-06-09 14:35:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/8/> {'title': u'Scrape Data Visually with Portia and Scrapy Cloud'} 2018-06-09 14:35:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/8/> {'title': u'Scrapinghub: A Remote Working Success Story'} 2018-06-09 14:35:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/8/> {'title': u'Why we moved to Slack'} 2018-06-09 14:35:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/8/> {'title': u'The History of Scrapinghub'} 2018-06-09 14:35:36 [scrapy.core.engine] DEBUG: Crawled (200) (referer: https://blog.scrapinghub.com/page/8/) 2018-06-09 14:35:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/9/> {'title': u'Skinfer: A Tool for Inferring JSON Schemas'} 2018-06-09 14:35:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/9/> {'title': u'Handling JavaScript in Scrapy with Splash'} 2018-06-09 14:35:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/9/> {'title': u'Scrapinghub Crawls the Deep Web'} 2018-06-09 14:35:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/9/> {'title': u'New Changes to Our Scrapy Cloud Platform'} 2018-06-09 14:35:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/9/> {'title': u'Introducing ScrapyRT: An API for Scrapy spiders'} 2018-06-09 14:35:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/9/> {'title': u'Looking Back at 2014'} 2018-06-09 14:35:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/9/> {'title': u'XPath Tips from the Web Scraping Trenches'} 2018-06-09 14:35:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/9/> {'title': u'Introducing Data Reviews'} 2018-06-09 14:35:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/9/> {'title': u'Extracting schema.org Microdata Using Scrapy Selectors and XPath'} 2018-06-09 14:35:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/9/> {'title': u'Announcing Portia, the Open Source Visual Web Scraper!'} 2018-06-09 14:35:38 [scrapy.core.engine] DEBUG: Crawled (200) (referer: https://blog.scrapinghub.com/page/9/) 2018-06-09 14:35:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/10/> {'title': u'Optimizing Memory Usage of Scikit-Learn Models Using Succinct Tries'} 2018-06-09 14:35:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/10/> {'title': u'Open Source at Scrapinghub'} 2018-06-09 14:35:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/10/> {'title': u'Looking Back at 2013'} 2018-06-09 14:35:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/10/> {'title': u'Marcos Campal Is a ScrapingHubber!'} 2018-06-09 14:35:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/10/> {'title': u'Introducing Dash'} 2018-06-09 14:35:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/10/> {'title': u'Why MongoDB Is a Bad Choice for Storing Our Scraped Data'} 2018-06-09 14:35:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/10/> {'title': u'Introducing Crawlera, a Smart Page Downloader'} 2018-06-09 14:35:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/10/> {'title': u'Git Workflow for Scrapy Projects'} 2018-06-09 14:35:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/10/> {'title': u'How to Fill Login Forms Automatically'} 2018-06-09 14:35:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/10/> {'title': u'Spiders activity graphs'} 2018-06-09 14:35:39 [scrapy.core.engine] DEBUG: Crawled (200) (referer: https://blog.scrapinghub.com/page/10/) 2018-06-09 14:35:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/11/> {'title': u'Finding Similar Items'} 2018-06-09 14:35:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/11/> {'title': u'Scrapy 0.15 dropping support for Python 2.5'} 2018-06-09 14:35:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/11/> {'title': u'Autoscraping casts a wider net'} 2018-06-09 14:35:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/11/> {'title': u'Scrapy 0.14 released'} 2018-06-09 14:35:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/11/> {'title': u'Dirbot \u2013 a new example Scrapy project'} 2018-06-09 14:35:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/11/> {'title': u'Introducing w3lib and scrapely'} 2018-06-09 14:35:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/11/> {'title': u'Scrapy 0.12 released'} 2018-06-09 14:35:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/11/> {'title': u'Spoofing your Scrapy bot IP using tsocks'} 2018-06-09 14:35:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com/page/11/> {'title': u'Hello, world'} 2018-06-09 14:35:39 [scrapy.core.engine] INFO: Closing spider (finished) 2018-06-09 14:35:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 2944, 'downloader/request_count': 11, 'downloader/request_method_count/GET': 11, 'downloader/response_bytes': 127678, 'downloader/response_count': 11, 'downloader/response_status_count/200': 11, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 6, 9, 5, 35, 39, 382615), 'item_scraped_count': 109, 'log_count/DEBUG': 121, 'log_count/INFO': 7, 'memusage/max': 50331648, 'memusage/startup': 50327552, 'request_depth_max': 10, 'response_received_count': 11, 'scheduler/dequeued': 11, 'scheduler/dequeued/memory': 11, 'scheduler/enqueued': 11, 'scheduler/enqueued/memory': 11, 'start_time': datetime.datetime(2018, 6, 9, 5, 35, 24, 377907)} 2018-06-09 14:35:39 [scrapy.core.engine] INFO: Spider closed (finished) ```

treby commented 6 years ago

良い感じ。 Scrapyはフレームワークなのか

treby commented 6 years ago

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

For more information including a list of features check the Scrapy homepage at: https://scrapy.org

treby commented 6 years ago

❯ python helloworld.py

そのまま投げてもだめなのかーまあ、クラスだけだし当然か

oss-gate / workshop

OSS Gate Workshop: Tokyo: 2018-06-09: treby: Scrapy: Work log #881

作業ログ作成時の説明

OSS Gateワークショップ関連情報