superryeti / Hands-on-WebScraping

This repo is a part of blog series on several web scraping projects where we will explore scraping techniques to crawl data from simple websites to websites using advanced protection.
MIT License
82 stars 74 forks source link

No tweets are being scraped. #10

Closed iprelic closed 1 year ago

iprelic commented 3 years ago

Hashtags are found, but it doesn`t find any tweets. I have lowerd the setting (delay and concurrency) and set ROBOTSTXT_OBEY to false. Any tips?

gfhswter commented 3 years ago

i don t know how it is supposed to work but i was sarching web for a long time and I didn't find anything useful :))

JuanDavidG1997 commented 3 years ago

In my case it doesn't find any tweets

michael-pagan commented 3 years ago

@amitupreti any insight on what's going on here? I've run this issue, resolved by pip installing ipdb, and this issue, resolved by updating the USER_AGENT to 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0', on top of the one posted here.

This seems like a great tool, but I've having a lot of trouble getting things to work. My current output below - you'll see "0 tweets are found," but visiting the queried URL clearly bring back results.

2021-09-28 22:00:30 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: TwitterHashTagCrawler) 2021-09-28 22:00:30 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l 24 Aug 2021), cryptography 3.4.8, Platform Windows-10-10.0.19042-SP0 2021-09-28 22:00:30 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor 2021-09-28 22:00:30 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'TwitterHashTagCrawler', 'NEWSPIDER_MODULE': 'TwitterHashTagCrawler.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['TwitterHashTagCrawler.spiders']} 2021-09-28 22:00:30 [scrapy.extensions.telnet] INFO: Telnet Password: 8bfdeceaee79e82e 2021-09-28 22:00:30 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2021-09-28 22:00:31 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2021-09-28 22:00:31 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2021-09-28 22:00:31 [scrapy.middleware] INFO: Enabled item pipelines: [] 2021-09-28 22:00:31 [scrapy.core.engine] INFO: Spider opened 2021-09-28 22:00:31 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2021-09-28 22:00:31 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2021-09-28 22:00:31 [root] INFO: 1 hashtags found 2021-09-28 22:00:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mobile.twitter.com/robots.txt> (referer: None) 2021-09-28 22:00:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mobile.twitter.com/hashtag/dogsoftwitter> (referer: None) 2021-09-28 22:00:31 [root] INFO: 0 tweets found 2021-09-28 22:00:31 [root] INFO: Next page found: 2021-09-28 22:00:31 [scrapy.core.engine] INFO: Closing spider (finished) 2021-09-28 22:00:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 559, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 23053, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'elapsed_time_seconds': 0.43433, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2021, 9, 29, 2, 0, 31, 493310), 'httpcompression/response_bytes': 83405, 'httpcompression/response_count': 2, 'log_count/DEBUG': 2, 'log_count/INFO': 13, 'response_received_count': 2, 'robotstxt/request_count': 1, 'robotstxt/response_count': 1, 'robotstxt/response_status_count/200': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2021, 9, 29, 2, 0, 31, 58980)} 2021-09-28 22:00:31 [scrapy.core.engine] INFO: Spider closed (finished)

adam321123 commented 2 years ago

i still have same issue. how to fix this?

michael-pagan commented 2 years ago

Couldn’t get it working. Wound up creating my own wrapper around tweepy.

zinDante commented 1 year ago

Any fixes on this issue? No tweets is showing but the hashtags are found if we visit the url .

superryeti commented 1 year ago

Hi, Everyone, the repo is no longer maintained. i am sorry about that. I will not be working on this anytime soon.