Closed elad661 closed 9 years ago
I've made something with python and it seems to be working, but I guess you'd want to re-write it in ruby if you want to include this in twitter_ebooks
Here's the code if anyone is interested:
#!/usr/bin/env python3
# coding=utf8
import argparse
import json
import os
import os.path
from operator import itemgetter
def main():
parser = argparse.ArgumentParser(description='Parse twitter-generated tweet archive to a format twitter_ebooks can understand')
parser.add_argument('path', metavar='path', type=str,
help='path to the archive')
args = parser.parse_args()
args.path = os.path.expanduser(args.path)
tweets_dir = os.path.join(args.path, 'data', 'js', 'tweets')
all_tweets = []
for month in sorted(os.listdir(tweets_dir)):
with open(os.path.join(tweets_dir, month), 'r') as f:
contents = f.read()
if not contents.startswith('['):
# Remove js variable assignment line, if exists
contents = contents[contents.index('\n')+1:]
this_month_tweets = json.loads(contents)
for tweet in this_month_tweets:
if 'retweeted_status' not in tweet: # remove retweets
all_tweets.append(tweet)
# Sort approximately the same way `ebooks archive` would sort
# (close enough, at least)
all_tweets.sort(key=itemgetter('created_at'))
all_tweets.reverse()
# Write the result
with open('archive.json', 'w') as f:
json.dump(all_tweets, f, ensure_ascii=False, indent=2)
if __name__ == "__main__":
main()
It crashes for me when wanting to consume the archive. Any idea how I can find out why? It works for another account.
└[~/var/ebooks/farthen_ebooks]> ebooks consume corpus/farthen.json Reading json corpus from corpus/farthen.json Removing commented lines and sorting mentions [1] 22214 terminated ebooks consume corpus/farthen.json
I don't know, I only tried it on one account and it worked.
twitter_ebooks should actually be able to read the csv file that's included with your Twitter archive. Copy the tweets.csv file into whatever folder you're working in (or just make sure you properly point to it) and then run: ebooks consume tweets.csv
Tada! No need to try and parse all the individual months that are found in the /data/js/tweets/ directory of your Twitter archive.
@rockbandit's solution worked for me. Also: the README.txt
that comes with your Twitter archive suggests "To consume the export in a generic JSON parser in any language, strip the first and last lines of each file." So you can push each of the js files through cat myfile | sed $d | perl -ne 'print if $. != 1' > newfile
and wind up with a usable JSON file.
As rockbandit notes, you can consume the csv from twitter archives directly. If you want to convert the csv file to json, "ebooks jsonify" in 3.0.9 will do that too :)
It would be awesome if there was a built in utility to convert twitter's archive .js files to a json file that twitter_ebooks can parse.