Open jackieli123723 opened 6 years ago
[root@lilidong /home/worker/python/pdf_crawler]# pip install Pillow bash: pip: command not found //python pip安装 下载安装 wget "https://pypi.python.org/packages/source/p/pip/pip-1.5.4.tar.gz#md5=834b2904f92d46aaa333267fb1c922bb" --no-check-certificate tar -axf pip-1.5.4.tar.gz cd pip-1.5.4/ python setup.py install 安装完后,使用 pip -V 报错,如下: bash: pip: command not found... 这时候怎么办呢?? 做一个软连接即可: 首先查下安装路径: find / -name pip 然做个软连接 ln -sv /usr/local/python/bin/pip /usr/bin/pip 做完就可以用了。 路径不要根据你自己的进行改变。 [root@lilidong /usr/local/bin]# ll total 35340 lrwxrwxrwx 1 root root 39 Jan 31 07:19 forever -> ../lib/node_modules/forever/bin/forever lrwxrwxrwx 1 root root 40 Jan 31 02:21 n -> /home/worker/node-v8.0.0-linux-x64/bin/n -rwxr-xr-x 1 root root 36186806 Jan 31 03:00 node lrwxrwxrwx 1 root root 38 Jan 31 03:00 npm -> ../lib/node_modules/npm/bin/npm-cli.js lrwxrwxrwx 1 root root 38 Jan 31 02:35 npx -> ../lib/node_modules/npm/bin/npx-cli.js lrwxrwxrwx 1 root root 42 Jan 31 02:16 pm2 -> /home/worker/node-v8.0.0-linux-x64/bin/pm2 lrwxrwxrwx 1 root root 45 Jan 31 02:20 rimraf -> /home/worker/node-v8.0.0-linux-x64/bin/rimraf lrwxrwxrwx 1 root root 42 Jan 31 02:20 ssr -> /home/worker/node-v8.0.0-linux-x64/bin/ssr lrwxrwxrwx 1 root root 42 Jan 31 02:27 webpack -> ../lib/node_modules/webpack/bin/webpack.js Processing dependencies for pip==1.5.4 Finished processing dependencies for pip==1.5.4 [root@lilidong /home/worker/python/pip-1.5.4]# find / -name pip /home/worker/python/pip-1.5.4/pip /home/worker/python/pip-1.5.4/build/lib/pip /usr/bin/pip /usr/lib/python2.7/site-packages/pip-1.5.4-py2.7.egg/pip /.cache/pip [root@lilidong /home/worker/python/pip-1.5.4]# ln -sv /usr/local/python/bin/pip /usr/bin/pip ln: failed to create symbolic link '/usr/bin/pip': File exists [root@lilidong /home/worker/python/pip-1.5.4]# ln -s /home/worker/python/pip-1.5.4/pip /usr/local/bin/pip [root@lilidong /home/worker/python/pip- 软连接(环境变量) ln -s /home/worker/python/pip-1.5.4/pip /usr/local/bin/pip [root@lilidong /home/worker/python/pip-1.5.4]# ln -s /home/worker/python/pip-1.5.4/pip /usr/local/bin/pip [root@lilidong /home/worker/python/pip-1.5.4]# cd /usr/local/bin/ [root@lilidong /usr/local/bin]# ll total 35340 lrwxrwxrwx 1 root root 39 Jan 31 07:19 forever -> ../lib/node_modules/forever/bin/forever lrwxrwxrwx 1 root root 40 Jan 31 02:21 n -> /home/worker/node-v8.0.0-linux-x64/bin/n -rwxr-xr-x 1 root root 36186806 Jan 31 03:00 node lrwxrwxrwx 1 root root 38 Jan 31 03:00 npm -> ../lib/node_modules/npm/bin/npm-cli.js lrwxrwxrwx 1 root root 38 Jan 31 02:35 npx -> ../lib/node_modules/npm/bin/npx-cli.js lrwxrwxrwx 1 root root 33 Feb 10 10:25 pip -> /home/worker/python/pip-1.5.4/pip lrwxrwxrwx 1 root root 42 Jan 31 02:16 pm2 -> /home/worker/node-v8.0.0-linux-x64/bin/pm2 lrwxrwxrwx 1 root root 45 Jan 31 02:20 rimraf -> /home/worker/node-v8.0.0-linux-x64/bin/rimraf lrwxrwxrwx 1 root root 42 Jan 31 02:20 ssr -> /home/worker/node-v8.0.0-linux-x64/bin/ssr lrwxrwxrwx 1 root root 42 Jan 31 02:27 webpack -> ../lib/node_modules/webpack/bin/webpack.js [root@lilidong /usr/local/bin]# [root@lilidong /usr/local/bin]# pip -v Usage: pip <command> [options] Commands: install Install packages. uninstall Uninstall packages. freeze Output installed packages in requirements format. list List installed packages. show Show information about installed packages. search Search PyPI for packages. wheel Build wheels from your requirements. zip DEPRECATED. Zip individual packages. unzip DEPRECATED. Unzip individual packages. bundle DEPRECATED. Create pybundles. help Show help for commands. General Options: -h, --help Show help. -v, --verbose Give more output. Option is additive, and can be used up to 3 times. -V, --version Show version and exit. -q, --quiet Give less output. --log-file <path> Path to a verbose non-appending log, that only logs failures. This log is active by default at /.pip/pip.log. --log <path> Path to a verbose appending log. This log is inactive by default. --proxy <proxy> Specify a proxy in the form [user:passwd@]proxy.server:port. --timeout <sec> Set the socket timeout (default 15 seconds). --exists-action <action> Default action when a path already exists: (s)witch, (i)gnore, (w)ipe, (b)ackup. --cert <path> Path to alternate CA bundle. [root@lilidong /home/worker/python]# cd pdf_crawler/ [root@lilidong /home/worker/python/pdf_crawler]# ll total 1564 -rw-r--r-- 1 root root 1369 Feb 10 09:27 crawler.py -rw-r--r-- 1 root root 1595408 Nov 6 2016 get-pip.py [root@lilidong /home/worker/python/pdf_crawler]# ll total 1564 -rw-r--r-- 1 root root 1369 Feb 10 09:27 crawler.py -rw-r--r-- 1 root root 1595408 Nov 6 2016 get-pip.py [root@lilidong /home/worker/python/pdf_crawler]# python crawler.py Traceback (most recent call last): File "crawler.py", line 8, in <module> import requests ImportError: No module named requests [root@lilidong /home/worker/python/pdf_crawler]# pip install requests Downloading/unpacking requests Downloading requests-2.18.4-py2.py3-none-any.whl (88kB): 88kB downloaded Downloading/unpacking certifi>=2017.4.17 (from requests) Downloading certifi-2018.1.18-py2.py3-none-any.whl (151kB): 151kB downloaded Downloading/unpacking idna>=2.5,<2.7 (from requests) Downloading idna-2.6-py2.py3-none-any.whl (56kB): 56kB downloaded Downloading/unpacking chardet>=3.0.2,<3.1.0 (from requests) Downloading chardet-3.0.4-py2.py3-none-any.whl (133kB): 133kB downloaded Downloading/unpacking urllib3>=1.21.1,<1.23 (from requests) Downloading urllib3-1.22-py2.py3-none-any.whl (132kB): 132kB downloaded Installing collected packages: requests, certifi, idna, chardet, urllib3 Found existing installation: chardet 2.2.1 Uninstalling chardet: Successfully uninstalled chardet Successfully installed requests certifi idna chardet urllib3 Cleaning up... [root@lilidong /home/worker/python/pdf_crawler]# ll total 1564 -rw-r--r-- 1 root root 1369 Feb 10 09:27 crawler.py -rw-r--r-- 1 root root 1595408 Nov 6 2016 get-pip.py [root@lilidong /home/worker/python/pdf_crawler]# python crawler.py Traceback (most recent call last): File "crawler.py", line 9, in <module> from bs4 import BeautifulSoup ImportError: No module named bs4 [root@lilidong /home/worker/python/pdf_crawler]# pip install bs4 Downloading/unpacking bs4 Downloading bs4-0.0.1.tar.gz Running setup.py (path:/tmp/pip_build_root/bs4/setup.py) egg_info for package bs4 Downloading/unpacking beautifulsoup4 (from bs4) Downloading beautifulsoup4-4.6.0-py2-none-any.whl (86kB): 86kB downloaded Installing collected packages: bs4, beautifulsoup4 Running setup.py install for bs4 Successfully installed bs4 beautifulsoup4 --结果 [root@lilidong /home/worker/python/pdf_crawler]# ll total 24016 -rw-r--r-- 1 root root 2878464 Feb 10 10:33 01StableMatching.pdf -rw-r--r-- 1 root root 2657280 Feb 10 10:33 02AlgorithmAnalysis.pdf -rw-r--r-- 1 root root 293888 Feb 10 10:33 03Graphs.pdf -rw-r--r-- 1 root root 3614720 Feb 10 10:33 04GreedyAlgorithmsI.pdf -rw-r--r-- 1 root root 2896896 Feb 10 10:33 05DivideAndConquerI.pdf -rw-r--r-- 1 root root 4186112 Feb 10 10:33 05DivideAndConquerII.pdf -rw-r--r-- 1 root root 508928 Feb 10 10:33 06DynamicProgrammingI.pdf -rw-r--r-- 1 root root 1150976 Feb 10 10:33 06DynamicProgrammingII.pdf -rw-r--r-- 1 root root 296960 Feb 10 10:33 07NetworkFlowI.pdf -rw-r--r-- 1 root root 348160 Feb 10 10:33 07NetworkFlowII.pdf -rw-r--r-- 1 root root 362496 Feb 10 10:33 08IntractabilityI.pdf -rw-r--r-- 1 root root 366592 Feb 10 10:33 08IntractabilityII.pdf -rw-r--r-- 1 root root 266240 Feb 10 10:33 10ExtendingTractability.pdf -rw-r--r-- 1 root root 277504 Feb 10 10:33 11ApproximationAlgorithms.pdf -rw-r--r-- 1 root root 321536 Feb 10 10:33 12LocalSearch.pdf -rw-r--r-- 1 root root 1369 Feb 10 09:27 crawler.py --- 现在的网络爬虫越来越多,有很多爬虫都是初学者写的,和搜索引擎的爬虫不一样,他们不懂如何控制速度,结果往往大量消耗服务器资源,导致带宽白白浪费了。 其实Nginx可以非常容易地根据User-Agent过滤请求,我们只需要在需要URL入口位置通过一个简单的正则表达式就可以过滤不符合要求的爬虫请求: ... location / { if ($http_user_agent ~* "python|curl|java|wget|httpclient|okhttp") { return 503; } # 正常处理 ... } ... 变量$http_user_agent是一个可以直接在location中引用的Nginx变量。~*表示不区分大小写的正则匹配,通过python就可以过滤掉80%的Python爬虫。 // 旧版浏览器兼容性支持 require('core-js/fn/array/from') require('core-js/fn/array/find-index') require('core-js/fn/array/find') require('core-js/fn/array/keys') require('core-js/fn/array/fill') require('core-js/fn/array/some') require('core-js/fn/object/assign') require('core-js/fn/object/values') https://m.dianping.com/auth/app?ft=5&ssp=true&redir= https://catdot.dianping.com/broker-service/api/js var _err = window.onerror; var url = location.protocol + '//catdot.dianping.com/broker-service/api/js'; window.onerror = function(err, file, line, col, error){ var e = encodeURIComponent; var time = Date.now(); (new window.Image).src = url + '?error=' + e(err) + '&v=1' + '&data=' + e(error && error.stack ? error.stack : '') + '&url=' + e(location.href) + '&file=' + e(file) + '&line=' + e(line) + '&col=' + e(col) + '×tamp=' + time; _err && _err(err, file, line, col, error); }; way1: entrypoint: - celery - -A - cmdb_api - beat - -S - django_celery_beat.schedulers:DatabaseScheduler - -l - info links: - redis:redis way2: command: - python - manage.py - runserver - 0.0.0.0:8000 C:\Users\Administrator\AppData\Roaming\npm;C:\Users\Administrator\AppData\Roaming\nvm;d:\Program Files\nodejs;C:\Python27;C:\ProgramData\Administrator\atom\bin https 必须是域名不能是ip加端口号 实际是py3 用的Python27 目录 py3 安装目录覆盖了py2 版本 C:\Users\Administrator\AppData\Roaming\npm;C:\Users\Administrator\AppData\Roaming\nvm;d:\Program Files\nodejs;C:\Python27;C:\ProgramData\Administrator\atom\bin C:\Users\Administrator\AppData\Roaming\npm;C:\Users\Administrator\AppData\Roaming\nvm;d:\Program Files\nodejs;C:\Python27;C:\ProgramData\Administrator\atom\bin C:\Users\Administrator>pip install requests Collecting requests Downloading requests-2.18.4-py2.py3-none-any.whl (88kB) 41% |█████████████▎ | 36kB 57kB/s eta 0:00:01 46% |██████████████▊ | 40kB 63kB/s eta 0:00:0 50% |████████████████▎ | 45kB 63kB/s eta 0:00 55% |█████████████████▊ | 49kB 65kB/s eta 0:0 60% |███████████████████▏ | 53kB 71kB/s eta 0 64% |████████████████████▊ | 57kB 75kB/s eta 69% |██████████████████████▏ | 61kB 83kB/s et 73% |███████████████████████▋ | 65kB 87kB/s e 78% |█████████████████████████▏ | 69kB 103kB/ 83% |██████████████████████████▋ | 73kB 109kB 87% |████████████████████████████ | 77kB 1.6M 92% |█████████████████████████████▌ | 81kB 1. 96% |███████████████████████████████ | 86kB 1 100% |████████████████████████████████| 90kB 130kB/s Collecting urllib3<1.23,>=1.21.1 (from requests) Downloading urllib3-1.22-py2.py3-none-any.whl (132kB) 40% |████████████▉ | 53kB 147kB/s eta 0:00:01 43% |█████████████▉ | 57kB 152kB/s eta 0:00:0 46% |██████████████▉ | 61kB 163kB/s eta 0:00: 49% |███████████████▉ | 65kB 169kB/s eta 0:00 52% |████████████████▉ | 69kB 138kB/s eta 0:0 55% |█████████████████▉ | 73kB 142kB/s eta 0: 58% |██████████████████▉ | 77kB 178kB/s eta 0 61% |███████████████████▉ | 81kB 178kB/s eta 65% |████████████████████▉ | 86kB 136kB/s eta 68% |█████████████████████▉ | 90kB 135kB/s et 71% |██████████████████████▉ | 94kB 232kB/s e 74% |███████████████████████▊ | 98kB 231kB/s 77% |████████████████████████▊ | 102kB 174kB/ 80% |█████████████████████████▊ | 106kB 172kB 83% |██████████████████████████▊ | 110kB 249k 86% |███████████████████████████▊ | 114kB 248 89% |████████████████████████████▊ | 118kB 18 92% |█████████████████████████████▊ | 122kB 1 95% |██████████████████████████████▊ | 126kB 99% |███████████████████████████████▊| 131kB 100% |████████████████████████████████| 135k B 255kB/s Collecting chardet<3.1.0,>=3.0.2 (from requests) Downloading chardet-3.0.4-py2.py3-none-any.whl (133kB) 43% |█████████████▊ | 57kB 100kB/s eta 0:00:0 46% |██████████████▊ | 61kB 106kB/s eta 0:00: 49% |███████████████▊ | 65kB 109kB/s eta 0:00 52% |████████████████▊ | 69kB 83kB/s eta 0:00 55% |█████████████████▊ | 73kB 84kB/s eta 0:0 58% |██████████████████▊ | 77kB 267kB/s eta 0 61% |███████████████████▋ | 81kB 267kB/s eta 64% |████████████████████▋ | 86kB 215kB/s eta 67% |█████████████████████▋ | 90kB 213kB/s et 70% |██████████████████████▋ | 94kB 211kB/s e 73% |███████████████████████▋ | 98kB 208kB/s 76% |████████████████████████▋ | 102kB 62kB/s 79% |█████████████████████████▌ | 106kB 62kB/ 82% |██████████████████████████▌ | 110kB 79kB 86% |███████████████████████████▌ | 114kB 79k 89% |████████████████████████████▌ | 118kB 79 92% |█████████████████████████████▌ | 122kB 7 95% |██████████████████████████████▌ | 126kB 98% |███████████████████████████████▌| 131kB 100% |████████████████████████████████| 135k B 77kB/s Collecting certifi>=2017.4.17 (from requests) Downloading certifi-2018.1.18-py2.py3-none-any.whl (151kB) 40% |█████████████ | 61kB 231kB/s eta 0:00:01 43% |█████████████▉ | 65kB 227kB/s eta 0:00:0 45% |██████████████▊ | 69kB 104kB/s eta 0:00: 48% |███████████████▋ | 73kB 104kB/s eta 0:00 51% |████████████████▍ | 77kB 104kB/s eta 0:0 54% |█████████████████▎ | 81kB 104kB/s eta 0: 56% |██████████████████▏ | 86kB 66kB/s eta 0: 59% |███████████████████ | 90kB 66kB/s eta 0: 62% |███████████████████▉ | 94kB 90kB/s eta 0 64% |████████████████████▊ | 98kB 89kB/s eta 67% |█████████████████████▋ | 102kB 70kB/s et 70% |██████████████████████▌ | 106kB 70kB/s e 72% |███████████████████████▍ | 110kB 110kB/s 75% |████████████████████████▏ | 114kB 108kB/ 78% |█████████████████████████ | 118kB 77kB/s 81% |██████████████████████████ | 122kB 77kB/ 83% |██████████████████████████▉ | 126kB 131k 86% |███████████████████████████▋ | 131kB 132 89% |████████████████████████████▌ | 135kB 78 91% |█████████████████████████████▍ | 139kB 7 94% |██████████████████████████████▎ | 143kB 97% |███████████████████████████████▏| 147kB 99% |████████████████████████████████| 151kB 100% |████████████████████████████████| 155k B 99kB/s Collecting idna<2.7,>=2.5 (from requests) Downloading idna-2.6-py2.py3-none-any.whl (56kB) 43% |██████████████ | 24kB 113kB/s eta 0:00:0 50% |████████████████▎ | 28kB 36kB/s eta 0:00 58% |██████████████████▋ | 32kB 41kB/s eta 0: 65% |█████████████████████ | 36kB 47kB/s eta 72% |███████████████████████▏ | 40kB 52kB/s e 79% |█████████████████████████▌ | 45kB 53kB/s 87% |███████████████████████████▉ | 49kB 55kB 94% |██████████████████████████████▏ | 53kB 2 100% |████████████████████████████████| 57kB 25kB/s Installing collected packages: urllib3, chardet, certifi, idna, requests Successfully installed certifi-2018.1.18 chardet-3.0.4 idna-2.6 requests-2.18.4 urllib3-1.22 You are using pip version 7.1.2, however version 9.0.1 is available. You should consider upgrading via the 'python -m pip install --upgrade pip' comm and. C:\Users\Administrator>pip -h Usage: pip <command> [options] Commands: install Install packages. uninstall Uninstall packages. freeze Output installed packages in requirements format. list List installed packages. show Show information about installed packages. search Search PyPI for packages. wheel Build wheels from your requirements. help Show help for commands. General Options: -h, --help Show help. --isolated Run pip in an isolated mode, ignoring environment variables and user configuration. -v, --verbose Give more output. Option is additive, and can be used up to 3 times. -V, --version Show version and exit. -q, --quiet Give less output. --log <path> Path to a verbose appending log. --proxy <proxy> Specify a proxy in the form [user:passwd@]proxy.server:port. --retries <retries> Maximum number of retries each connection should attempt (default 5 times). --timeout <sec> Set the socket timeout (default 15 seconds). --exists-action <action> Default action when a path already exists: (s)witch, (i)gnore, (w)ipe, (b)ackup. --trusted-host <hostname> Mark this host as trusted, even though it does not have valid or any HTTPS. --cert <path> Path to alternate CA bundle. --client-cert <path> Path to SSL client certificate, a single file containing the private key and the certificate in PEM format. --cache-dir <dir> Store the cache data in <dir>. --no-cache-dir Disable the cache. --disable-pip-version-check Don't periodically check PyPI to determine whether a new version of pip is available for download. Implied with --no-index. C:\Users\Administrator> E:\jackieli\python\python爬虫\python3-crawler\baike_spider>python spider_main.py Traceback (most recent call last): File "spider_main.py", line 2, in <module> from baike_spider import url_manager, html_downloader, html_parser, html_out puter ImportError: No module named 'baike_spider' E:\jackieli\python\python爬虫\python3-crawler\baike_spider>python spider_main.py Traceback (most recent call last): File "spider_main.py", line 2, in <module> from baike_spider import url_manager, html_downloader, html_parser, html_out puter ImportError: No module named 'baike_spider' E:\jackieli\python\python爬虫\python3-crawler\baike_spider>python spider_main.py Traceback (most recent call last): File "spider_main.py", line 2, in <module> import url_manager, html_downloader, html_parser, html_outputer File "E:\jackieli\python\python爬虫\python3-crawler\baike_spider\html_parser.p y", line 1, in <module> from bs4 import BeautifulSoup ImportError: No module named 'bs4' E:\jackieli\python\python爬虫\python3-crawler\baike_spider>pip install bs4 Collecting bs4 Downloading bs4-0.0.1.tar.gz Collecting beautifulsoup4 (from bs4) Downloading beautifulsoup4-4.6.0-py3-none-any.whl (86kB) 42% |█████████████▋ | 36kB 60kB/s eta 0:00:01 47% |███████████████ | 40kB 67kB/s eta 0:00:0 51% |████████████████▋ | 45kB 72kB/s eta 0:00 56% |██████████████████▏ | 49kB 78kB/s eta 0: 61% |███████████████████▋ | 53kB 100kB/s eta 66% |█████████████████████▏ | 57kB 100kB/s et 70% |██████████████████████▋ | 61kB 113kB/s e 75% |████████████████████████▏ | 65kB 122kB/s 80% |█████████████████████████▊ | 69kB 120kB/ 84% |███████████████████████████▏ | 73kB 129k 89% |████████████████████████████▊ | 77kB 243 94% |██████████████████████████████▏ | 81kB 2 99% |███████████████████████████████▊| 86kB 100% |████████████████████████████████| 90kB 193kB/s Installing collected packages: beautifulsoup4, bs4 Running setup.py install for bs4 Successfully installed beautifulsoup4-4.6.0 bs4-0.0.1 You are using pip version 7.1.2, however version 9.0.1 is available. You should consider upgrading via the 'python -m pip install --upgrade pip' comm and. E:\jackieli\python\python爬虫\python3-crawler\baike_spider>python spider_main.py craw 1 : http://baike.baidu.com/item/Python craw 2 : http://baike.baidu.com/view/10812319.htm E:\jackieli\python\python爬虫\python3-crawler\baike_spider>
易错知识点