micahflee commented 1 year ago

I'm working on completely reworking the open source semiphemeral project to have all of the functionality of the hosted app Semiphemeral.com. Rather than refactoring the open source version, I'm basically starting over and copying chunks of the Semiphemeral.com code into it and then updating it to work for just a single user.

This is a work-in-progress PR.

Set up a dev environment

There's now a BUILD.md with instructions on getting started. You need Python 3 and Node.js installed. Install dependencies like this:

poetry install
cd semiphemeral/frontend
npm install
cd ../..

All the frontend code is in semiphemeral/frontend and you can build it by running this (you need to re-run it each time you edit any frontend code):

poetry run build

You can start the Semiphemeral server by running:

poetry run semiphemeral

During development I just run poetry run build && poetry run semiphemeral. When I make changes to any of the code I press CTRL-C and just run the same thing again.

All data is stored in ~/.semiphemeral. The settings (including Twitter API creds) are stored in settings.json, and then everything else is in a sqlite3 database called data.db.

How it works so far

When you first run, before you have configured it with your API credentials, it gives you step-by-step instructions on how to get those credentials from the Twitter Developer Portal. You can't proceed until you provide the correct API creds and test that they work:

Screenshot 2023-06-11 at 2 02 37 PM Screenshot 2023-06-11 at 2 03 25 PM

After you give valid API credentials, the rest of the app is unlocked and it redirects you to the Settings page:

Screenshot 2023-06-11 at 2 04 48 PM

You can then choose your settings, and go to the Dashboard.

Screenshot 2023-06-11 at 2 06 46 PM

I've started implementing logic that requires you to download twitter data first, and delete later. I'm not done with this part yet, and this is where I want help from others. First let me explain a few things.

How jobs work

This is a flask app (with socketio), but there's also a background thread for the run_jobs function in the background:

def main():
    # Start the function in a background thread
    thread = threading.Thread(target=run_jobs, args=(socketio, db_session, settings,))
    thread.start()

    # Start the web server
    print("Use Semiphemeral at: http://localhost:8080/")
    print("Press CTRL-C to quit")
    socketio.run(app, host='0.0.0.0', port=8080)

    # Wait for the background thread to finish
    thread.join()

run_jobs is an infinite loop that selects all of the pending jobs from the database and runs them one at a time. During a job, it pushes progress updates to the client via socketio.

When you click "Download Twitter Data" the flask route just adds a pending download job to the database. In the job runner thread, it selects that job and then runs it. Here's the download job function: https://github.com/micahflee/semiphemeral/blob/modernize/semiphemeral/jobs.py#L78-L290

It starts by create a Twitter API v1.1 client and verifying that the Twitter creds are accurate, then goes on to start download the entire history of tweets and likes, pushing updates the client the whole time:

# Download job
def download(socketio, db_session, settings, job):
    print(f"Starting download job: {job}")
    start_job(socketio, db_session, job, "Download started")

    # Create Twitter API client
    api = create_tweepy_client_v1_1(
        settings.get("twitter_api_key"),
        settings.get("twitter_api_secret"),
        settings.get("twitter_access_token"),
        settings.get("twitter_access_token_secret"),
    )
    try:
        response = api.verify_credentials()
    except Exception as e:
        fail_job(socketio, db_session, job, f"Failed to verify credentials: {e}")
        socketio.emit("fail", {"message": f"Failed to verify credentials: {e}"})
        return

    since_id = settings.get("since_id")

    print("Download started")

    # Start the data
    if since_id:
        update_progress(socketio, db_session, job, "Downloading recent tweets")
    else:
        update_progress(socketio, db_session, job, "Downloading all tweets, this first run may take a long time")

However, when I run it, it hits an exception on tweepy.API.user_timeline:

Forbidden error: 403 Forbidden 453 - You currently have access to a subset of Twitter API v2 endpoints and limited v1.1 endpoints (e.g. media post, oauth) only. If you need access to this endpoint, you may need a different access level. You can learn more here: https://developer.twitter.com/en/portal/product

Screenshot 2023-06-11 at 2 16 40 PM

To explain this, lemme go into the difference between the Twitter API v1.1 and v2.

Twitter API v1.1 and v2

Twitter has two different APIs, the modern v2 API and the older legacy v1.1 API. Semiphemeral at the moment only uses the the v1.1 API, however there's code in common.py that lets you create an API client using either API of your choice. The create_tweepy_client_v2 makes a v2 client, and create_tweepy_client_v1_1 makes a v1.1 client.

All the Twitter API code uses Tweepy. Check out the Tweepy docs -- on the left there are two separate sections for "Twitter API v1.1 Reference" and "Twitter API v2 Reference". Make sure you look at the correct docs for the API that you're using.

API v2 is a bit nicer and more modern than v1.1, however Twitter started aggressively rate limiting everything in the v2 API. Awhile back I had actually updated all of the Semiphemeral.com code to use the v2 API but then quickly hit my limit on the number of tweets I could download, so I switched it all back to the v1.1 API since that has no such limit.

However, it looks like when you create a brand new Twitter app and get new free API credentials, these creds don't support the API v1.1 endpoints that we need to download tweets. (I think the Semiphemeral.com API creds are kind of grandfathered in and still somewhat work, but not new ones.) So the only way forward is to use API v2.

This is where I need help, and also is it worth it at all?

I've already deleted all of my tweets, and none of my burner Twitter accounts have any tweets in them. So I actually don't have access to an account to very thoroughly test the download code even if I did rewrite it to use API v2.

And when I go to https://developer.twitter.com/en/portal/product (logged into the Twitter account I made my API creds with) I can see that API v2 has a limit of pulling 1,500 tweets a month. That's like, a very small number.

Screenshot 2023-06-11 at 2 27 19 PM

Let's say @torproject wanted to delete its old tweets. That account has 12.7K tweets, which honestly is on the low end for a Twitter account that's been around for a long time. That would take over 8 months just to download the tweets in order to delete them, lol.

Is this project, or making a self-hosted only Semiphemeral, even worth it anymore?

Anyway, I think this is where I've hit my limit on what I can do. I'd like help with maybe trying to convert this download job code from API v1.1 to API v2 and test it to make sure it works. But with such a small monthly cap, maybe this is all a waste of time.

My only other idea is relying on the downloadable Twitter archive-- the user can download their archive and then upload it to Semiphemeral to delete those tweets and likes. I don't know if API v2 would hit other deletion caps though, and again this isn't something I can easily do without access to a real Twitter account full of data.

JohnVeness commented 1 year ago

Thanks for this Micah. There's already a merged PR (#23) to import a downloaded Twitter archive, if you're unaware.

Also, I don't know if it's still the case, but about a year ago it was still possible to request access to the 1.1 API, even if they only gave you access to 2.0 by default.

SaptakS commented 1 year ago

I think maybe it's worth testing the archive workflow (but need to verify if they have deletion limit). I don't think 1500 tweets really help anyone. But also maybe instead of downloading all tweets, we can download only tweets that match the settings for deletion. 🤔 A part of me also wonder if there is some way to web scrape the tweets instead of API, but I feel twitter will stop that very quickly.

The other option is basically just to convert the code to use API v2, and then kind of leave the project, mentioning that anyone who wants to use can still use it (in case someone thinks that 1500 limit is fine for them), but we are not adding any new features or maintaining it, but can probably review PRs if others want to contribute changes to keep up with the API definitions.

I feel if API v2 is the only way ahead, then unless the code to use v2 is overly different and complicated, we can just do that and then merge and leave the project (probably archive?). But if downloading the archive and importing it into the platform and then doing deletion works, I think that is probably a fine workflow as well.

dstN commented 1 year ago

I could test with an archive with about ~17k tweets. I want to delete most of it anyway but never worked with python and I'm on windows.

atlcell commented 1 year ago

@micahflee will re-visit this soon to help all of us get a working copy running locally.Will learn Python as we go. FYI, the old standalone version was working on the old API, but would/could only unretweet - couldn't debug deleting tweets. I want to contribute, but I would like/need to work with you/pairing for maximum effectiveness

micahflee / semiphemeral

Merge all the latest from semiphemeral.com into open source project #125

Set up a dev environment

How it works so far

How jobs work

Twitter API v1.1 and v2

This is where I need help, and also is it worth it at all?