taspinar / twitterscraper

Scrape Twitter for Tweets
MIT License
2.39k stars 579 forks source link

WIP: Use Selenium to Enable Javascript / Real-Browser Scraping + Misc Fixes #302

Open lapp0 opened 4 years ago

lapp0 commented 4 years ago

A variety of issues have recently arisen due to Twitter disabling their "Legacy" API, breaking twitterscraper:

To fix this, I re-implemented query.py using Selenium, allowing twitterscraper to programatically control a background (headless) Firefox instance.

Additionally, I refactored query.py (now query_js.py) to be a bit cleaner.

Based on my testing, this branch can successfully download tweets from user pages, and via query strings.

How to run

Please test this change so I can fix any bugs! 1) clone the repo, pull this branch 2) install selenium dependencies (geckodriver and firefox) https://selenium-python.readthedocs.io/installation.html 3) enter twitterscraper directory, python3 setup.py install 4) run your query

If you have any bugs, please paste your command and full output in this thread!

Improvements

Notes

Problems

lapp0 commented 3 years ago

@christiangfv please try what I just pushed

tahmidrashid commented 3 years ago

I can confirm that this method works quite well. I am wondering if using helium won't solve the geckodriver installation issue? As helium ships with common webdrivers.

See helium https://github.com/mherrmann/selenium-python-helium

I tried this suggestion from a friend and was able to make the geckodriver work (in Ubuntu 18.04 with Firefox 80.0.1):

wget https://github.com/mozilla/geckodriver/releases/download/v0.27.0/geckodriver-v0.27.0-linux64.tar.gz
tar -xvf geckodriver-v0.27.0-linux64.tar.gz
sudo mv geckodriver /usr/bin
christiangfv commented 3 years ago

@lapp0 does not change, I have the same 502 error. I try it with proxy, no proxy, in firefox and chrome driver and it's the same

lapp0 commented 3 years ago

@christiangfv thats unfortunate. Perhaps someone more talented than me could update the docker image to install gnu/linux, selenium, geckodriver, and firefox. I'm not sure what your specific issue is. Perhaps you could try running it in a debian virtual machine?

(also, I assume you pulled and ran python3 setup.py install?)

christiangfv commented 3 years ago

Yes @lapp0 , I have done it, now I will test in a virtual machine with Ubuntu.

christiangfv commented 3 years ago

@lapp0 Works in linux Ubuntu!! Ubuntu is always a faithful ally. But I have a new problem, you know some this? mi code: tweets = twitterscraper.get_query_data('realDonaldTrump',poolsize=1,begindate=dt.date(2020, 9, 20),limit=10) print(tweets) and this function return a void list. but when I debug, I can check that take a correctly tweet. image

lapp0 commented 3 years ago

Thanks for testing @christiangfv !

I made a mistake in the date filtering function I pushed today so it doesn't allow single-day ranges. Pushing a fix.

christiangfv commented 3 years ago

With this you can solve image but it is already resolved with your code jajjaja Thanks @lapp0, Hey friend, do you know if this implementation works on a server with ubuntuServer or other?

lapp0 commented 3 years ago

@christiangfv Good to hear!

I don't see any reason why this wouldn't work on ubuntu server

Buaasinong commented 3 years ago

image I got this problem and i don't know why I use "From within Python"

Buaasinong commented 3 years ago

image I got this problem and i don't know why I use "From within Python"

image I had run this, but it seemed no effect

christiangfv commented 3 years ago

@Buaasinong Are you in the correct directory? What is your operating system? in linux you need the sudo command.

Buaasinong commented 3 years ago

Are you in the correct directory? What is your operating system? in linux you need the sudo command.

Oh, i'm using windows.And i think my directory is correct.

Buaasinong commented 3 years ago

Are you in the correct directory? What is your operating system? in linux you need the sudo command.

Oh, i'm using windows.And i think my directory is correct.

I try to run setup.py directly ,and get this image

christiangfv commented 3 years ago

@Buaasinong The correct command for the installation is Python3 setup.py install. You need add argument install

Buaasinong commented 3 years ago

@Buaasinong The correct command for the installation is Python3 setup.py install. You need add argument install

Yes! I used python setup.py install(because python3 setup.py install is still no result).But it still don't work. Could you give me a picture about successfully running in terminal?

Buaasinong commented 3 years ago

@Buaasinong The correct command for the installation is Python3 setup.py install. You need add argument install

Yes! I used python setup.py install(because python3 setup.py install is still no result).But it still don't work. Could you give me a picture about successfully running in terminal?

image OH,I think python setup.py install is not useful.sad

lapp0 commented 3 years ago

@Buaasinong are you running python2? What is python --version?

Would be really helpful if someone got the docker working with selenium so we could avoid these version and platform specific issues.

christiangfv commented 3 years ago

Hi @lapp0, I have noticed that when I retrieve the tweets, they do not appear in order. I only need the last 20 tweets from an account in order. For this I implement changes in image and the new link that i used is: INIT_URL = 'https://twitter.com/search?q={q}&src=typed_query&f=live' and finally, for sorted tweets, I sorted previusly the dict with tweets in: image

Buaasinong commented 3 years ago

@Buaasinong are you running python2? What is python --version?

Would be really helpful if someone got the docker working with selenium so we could avoid these version and platform specific issues.

python 3.7 ,but it‘s 32bits

Buaasinong commented 3 years ago

Hi Bro ,I got a new problem like you in browser @christiangfv ,and I want to disable proxy for using --disableproxy,there is an error in my terminal image

christiangfv commented 3 years ago

Hi @Buaasinong, the problem is that the --disableproxy or -dp argument is not implemented. I had the problem on macOS 10.15.6, I think it could be system restrictions. I solved it by installing on Ubuntu linux

lapp0 commented 3 years ago

--disableproxy hasn't been merged into this branch yet, but is in master. However, I don't think that's the problem. I think it's a problem specific to selenium that don't involve the proxies at all.

I have pushed untested docker changes to install geckodriver, however they're pretty straightforward. @Buaasinong could you try to use the Dockerfile to build and run?

lapp0 commented 3 years ago

Bug find:

The following query has variable results: https://twitter.com/search?q=foo+bar+since%3A2017-11-11+until%3A2017-11-13

Sometimes it includes foobar is being so I'm out the first round and sometimes it doesn't.

I can't figure out what conditions result in this being the case, sometimes there are 20 results, sometimes there are 40, sometimes 57. Perhaps the only option is to require multiple passes, unfortunately.

You can try this manually by going to that url and retrying multiple times until foobar is being so I'm out the first round is missing (or present) to see its variable presence.

first hypothesis: different proxies have different results

second hypothesis: different twitter servers have different results cached, and I may be querying different servers each time?

This is tough to test.

Buaasinong commented 3 years ago

--disableproxy hasn't been merged into this branch yet, but is in master. However, I don't think that's the problem. I think it's a problem specific to selenium that don't involve the proxies at all.

I have pushed untested docker changes to install geckodriver, however they're pretty straightforward. @Buaasinong could you try to use the Dockerfile to build and run?

Thanks Bro, I will try

Buaasinong commented 3 years ago

Hi @Buaasinong, the problem is that the --disableproxy or -dp argument is not implemented. I had the problem on macOS 10.15.6, I think it could be system restrictions. I solved it by installing on Ubuntu linux

Wow, It's a bad news.But still appreciate.

Bakdaulet1 commented 3 years ago

Upon running a query, I get the following error: "ModuleNotFoundError: No module named 'twitterscraper.ts_logger"

lapp0 commented 3 years ago

@Bakdaulet1 I merged master and that broke some stuff. Most recent push should resolve.

lapp0 commented 3 years ago

I ran this 100 times:

query_js.get_query_data(begindate=dt.date(2019, 11, 11), enddate=dt.date(2019, 11, 13), poolsize=3, lang='en', query='foo bar')

I then counted the number of tweets recieved each time. Here is the mapping of "number of tweets" -> "number of times this number of tweets were returned"

41: 9
55: 16
70: 20
73: 1
74: 1
83: 3
84: 50

So exactly half the time we got 84 tweets.

I'm running this experiment again with multiple passes for each day (unfortunate but probably workable solution), will update this comment with results.

webcoderz commented 3 years ago

https://github.com/lapp0/twitterscraper/blob/faf75afb7eff15a04df471a0b651011b67db85fa/Dockerfile#L4 so the firefox install here in the docker file doesnt work as is , https://hub.docker.com/r/selenium/node-firefox/dockerfile should perhaps maybe use the officially maintained selenium//firefox dockerfile as a base ??

lapp0 commented 3 years ago

@webcoderz you probably know docker better than me, as I use a different packaging/containerizing solution (nix). Would you be willing to make the change?

webcoderz commented 3 years ago

would you recommend doing 2 docker services communicating over a docker network vs one container vs single container?

lapp0 commented 3 years ago

@webcoderz IMHO it'd be better to have a single docker container.

Bakdaulet1 commented 3 years ago

I am getting the following error:

INFO:twitterscraper:Scraping tweets from https://twitter.com/Bakdaul52729738 INFO:twitterscraper:Using proxy 191.96.42.80:8080 INFO:twitterscraper:Got 0 tweets from username Bakdaul52729738

lapp0 commented 3 years ago

Number of passes -> % of time it collects every single tweet

get_query_data(begindate=dt.date(2019, 11, 11), enddate=dt.date(2019, 11, 13), poolsize=3, lang='en', query='foo bar') (expect 84 total)

get_query_data(begindate=dt.date(2018, 5, 5), enddate=dt.date(2018, 5, 7), poolsize=3, lang='en', query='trump') (expect ?? total) (running, will update)

lapp0 commented 3 years ago

@Bakdaulet1 please share your command/code

Bakdaulet1 commented 3 years ago

@Bakdaulet1 please share your command/code

twitterscraper Bakdaul52729738--user -o tweets_username.json

lapp0 commented 3 years ago

@Bakdaulet1 the legacy endpoint is broken. To use the changes in this merge request you need the --javascript (or -j) argument. Also please merge the following changes I'm about to push before running.

webcoderz commented 3 years ago

would also be nice to have a function that scrapes by twitter id....

you can use this twitter_id="1308322643054034944" "https://twitter.com/anyuser/status/{}".format(twitter_id)

https://twitter.com/anyuser/status/1308322643054034944 will take you to the correct tweet , alls you need is the twitter id

lapp0 commented 3 years ago

@webcoderz good idea! Although I think that's out of scope for this merge request. Feel free to make your own.

For a programatic example, see test_simple.py https://github.com/taspinar/twitterscraper/pull/302/files#diff-9ccfdcc9b53fb500cf131ea181f733e8

webcoderz commented 3 years ago

ok had found a error in my Dockerfile will try again and let you know, just pushed the change and PR (had symlink to just geckdriver but not to selenium wire

webcoderz commented 3 years ago

@lapp0 submitted PR that dockerfile tested works, puts executable in /bin for geckodriver, was wondering can also add the other browser drivers as well if u wanted to do rotating browsers for improved proxying capability or something like that

lapp0 commented 3 years ago

IMO, adding new browsers will just increase complexity with little to no benefit. I don't think twitter is going to throttle/ban Firefox.

The biggest problem right now is memory and CPU, so if chrome is more efficient browsing twitter, we might want to go that way. Although, I think installing ublock origin in firefox would be a better decision for that.

lapp0 commented 3 years ago

Ran into this for a big query:

image

Going to experiment with a throttling mechanism avoid

webcoderz commented 3 years ago

i think can do that all via docker too will look into it..

lapp0 commented 3 years ago

That'd be great, we might even be able to set custom ublock rules to block irrelevant twitter endpoints and speed up scraping.

smuotoe commented 3 years ago

image

@lapp0 I noticed this in the latest commit... And I am curious, with the return keywords in the if...else conditional statement, will the code ever go further to retry?

lapp0 commented 3 years ago

@smuotoe Yeah, that entire function needs to be cleaned up. It's slowly built technical debt as I've handled more and more edge cases.

Good catch, pushing a fix.

webcoderz commented 3 years ago

https://blog.mozilla.org/addons/2019/10/31/firefox-to-discontinue-sideloaded-extensions/

lapp0 commented 3 years ago

@webcoderz thats too bad. Do you know if chromium has these capabilities?