mispy-archive / twitter_ebooks

Better twitterbots for all your friends~
MIT License
972 stars 140 forks source link

Utility to convert twitter archive .js files to twitter_ebooks compatible json #59

Closed elad661 closed 9 years ago

elad661 commented 9 years ago

It would be awesome if there was a built in utility to convert twitter's archive .js files to a json file that twitter_ebooks can parse.

elad661 commented 9 years ago

I've made something with python and it seems to be working, but I guess you'd want to re-write it in ruby if you want to include this in twitter_ebooks

Here's the code if anyone is interested:

#!/usr/bin/env python3
# coding=utf8
import argparse
import json
import os
import os.path
from operator import itemgetter

def main():
    parser = argparse.ArgumentParser(description='Parse twitter-generated tweet archive to a format twitter_ebooks can understand')
    parser.add_argument('path', metavar='path', type=str,
                        help='path to the archive')
    args = parser.parse_args()
    args.path = os.path.expanduser(args.path)
    tweets_dir = os.path.join(args.path, 'data', 'js', 'tweets')
    all_tweets = []
    for month in sorted(os.listdir(tweets_dir)):
        with open(os.path.join(tweets_dir, month), 'r') as f:
            contents = f.read()
            if not contents.startswith('['):
                # Remove js variable assignment line, if exists
                contents = contents[contents.index('\n')+1:]
            this_month_tweets = json.loads(contents)
            for tweet in this_month_tweets:
                if 'retweeted_status' not in tweet:  # remove retweets
                    all_tweets.append(tweet)

    # Sort approximately the same way `ebooks archive` would sort
    # (close enough, at least)
    all_tweets.sort(key=itemgetter('created_at'))
    all_tweets.reverse()

    # Write the result
    with open('archive.json', 'w') as f:
        json.dump(all_tweets, f, ensure_ascii=False, indent=2)

if __name__ == "__main__":
    main()
felinira commented 9 years ago

It crashes for me when wanting to consume the archive. Any idea how I can find out why? It works for another account.

└[~/var/ebooks/farthen_ebooks]> ebooks consume corpus/farthen.json Reading json corpus from corpus/farthen.json Removing commented lines and sorting mentions [1] 22214 terminated ebooks consume corpus/farthen.json

elad661 commented 9 years ago

I don't know, I only tried it on one account and it worked.

daveschumaker commented 9 years ago

twitter_ebooks should actually be able to read the csv file that's included with your Twitter archive. Copy the tweets.csv file into whatever folder you're working in (or just make sure you properly point to it) and then run: ebooks consume tweets.csv

Tada! No need to try and parse all the individual months that are found in the /data/js/tweets/ directory of your Twitter archive.

brighid commented 9 years ago

@rockbandit's solution worked for me. Also: the README.txt that comes with your Twitter archive suggests "To consume the export in a generic JSON parser in any language, strip the first and last lines of each file." So you can push each of the js files through cat myfile | sed $d | perl -ne 'print if $. != 1' > newfile and wind up with a usable JSON file.

ghost commented 9 years ago

As rockbandit notes, you can consume the csv from twitter archives directly. If you want to convert the csv file to json, "ebooks jsonify" in 3.0.9 will do that too :)