爬虫 - Githubissues

wanghaisheng / wanghaisheng.github.io

我的博客

https://wanghaisheng-github-io.vercel.app

17 stars 3 forks source link

爬虫 #19

Closed wanghaisheng closed 9 years ago

wanghaisheng commented 9 years ago

1、能不能做一张热点图，显示全国各地的水质情况怎么样拿到这样的api

水的问题比PM2.5更严重，PM2.5好歹可以等风来//@TurtleIzzy: 怒筛MRSA
@人民日报
【你好，明天】长江等河流检出抗生素，南京自来水惊现阿莫西林。这令人震惊更令人忧心：当生命之水变成药水，谁来为国人的健康负责？更应追问：药厂如此大面积长期排放，是标准缺失还是监管缺位？彻查原因，如实公布，别再让闪躲回避，遮掩了真相、透支了公信！

可参考 https://github.com/bsspirit/chinaWeatherDemo/tree/app

2、关键词水质日报环保局

wanghaisheng commented 9 years ago

https://github.com/commoncrawl https://github.com/norvigaward 利用commoncrawl提供的数据做一些聚类分析的例子

wanghaisheng commented 9 years ago

1、能不能从http://www.twitch.tv/爬取一份游戏语料呢 2、看起来只要爬取网站文章的正文，使用该TextGrocery就可以对其打标签，比如说是分别是哪些游戏相关的文章，与哪些人、公司相关

wanghaisheng commented 9 years ago

https://github.com/code4craft/webmagic java版本

wanghaisheng commented 9 years ago

国内公司信息的爬虫 https://github.com/iPitaya/CrawlerNew_crawler 天猫信息的爬虫 https://github.com/iPitaya/crawlerTmall_crawler4j

wanghaisheng commented 9 years ago

http://app1.sfda.gov.cn/datasearch/face3/base.jsp?tableId=25&tableName=TABLE25&title=%B9%FA%B2%FA%D2%A9%C6%B7&bcId=124356560303886909015737447882 食药监药品数据的抓取

找到一个现成的 https://github.com/waitingmyself/drugs 先玩一下再说

wanghaisheng commented 9 years ago

目前药品数据库的构建进行中

wanghaisheng commented 7 years ago

Selenium+Chrome Driver在爬虫里已经用了蛮久了，适当模拟真人操作效果，再结合 OCR，效果极佳。 juhezhishu.com 里面的百度指数就是这么来的 [[笑cry]] 但是还是有反爬方案，比如 Distil 会不允许注入 js

目前ips的虚拟打印技术能解决那些需要打印的业务数据的捕获除此之外的数据呢一类是CS客户端形式的一类是BS web界面的如何获取这两类数据呢当然像集成平台一样去读数据库自然是最低成本的方法

https://github.com/spikesoffshore/Isla_Automation
https://github.com/Felix-P-Code/scrapyweixi 
scrapy+selenium+phantomjs做的微信采集，遇见验证码发到打码平台
https://github.com/spikesoffshore/Spikes_Automation/tree/master
百姓网九宫格验证测试 
https://github.com/zhr0319/Office/tree/master/%E7%99%BE%E5%A7%93%E7%BD%91%E4%B9%9D%E5%AE%AB%E6%A0%BC%E6%B5%8B%E8%AF%95
https://github.com/rdmpage/ocr-correction
https://github.com/congsang/Other/tree/master