voussoir / timesearch

The subreddit archiver
BSD 3-Clause "New" or "Revised" License
172 stars 7 forks source link
api archival cli database pushshift python reddit sql

timesearch

NEWS (2023 06 25):

Pushshift's API is currently offline. Without the timestamp search parameter or Pushshift access, timesearch is not able to get historical data. You can continue to use the livestream module to collect new posts and comments as they are made.

You can still download the Pushshift archives, though. https://the-eye.eu/redarcs/ is one source.

I have added a module for ingesting these json files into a timesearch database so that you can continue to use offline_reading, or if you just prefer the sqlite format. You need to extract the zst file with an archive tool like 7-Zip before giving it to timesearch.

python timesearch.py ingest_jsonfile subredditname_submissions -r subredditname

python timesearch.py ingest_jsonfile subredditname_comments -r subredditname

NEWS (2023 05 01):

Reddit has revoked Pushshift's API access, so pushshift.io may not be able to continue ingesting reddit content.

NEWS (2018 04 09):

Reddit has removed the timestamp search feature which timesearch was built off of (original). Please message the admins by sending a PM to /r/reddit.com. Let them know that this feature is important to you, and you would like them to restore it on the new search stack.

Thankfully, Jason Baumgartner aka /u/Stuck_in_the_Matrix, owner of Pushshift.io, has made it easy to interact with his dataset. Timesearch now queries his API to get post data, and then uses reddit's /api/info to get up-to-date information about those posts (scores, edited text bodies, ...). While we're at it, this also gives us the ability to speed up get_comments. In addition, we can get all of a user's comments which was not possible through reddit alone.

NOTE: Because Pushshift is an independent dataset run by a regular person, it does not contain posts from private subreddits. Without the timestamp search parameter, scanning private subreddits is now impossible. I urge once again that you contact your senator the admins to have this feature restored.


I don't have a test suite. You're my test suite! Messages go to /u/GoldenSights.

Timesearch is a collection of utilities for archiving subreddits.

Make sure you have:

This package consists of:

To use it

When you download this project, the main file that you will execute is timesearch.py here in the root directory. It will load the appropriate module to run your command from the modules folder.

You can view a summarized version of all the help text by running timesearch.py, and you can view a specific help text by running a command with no arguments, like timesearch.py livestream, etc.

I recommend sqlitebrowser if you want to inspect the database yourself.

Changelog


I want to live in a future where everyone uses UTC and agrees on daylight savings.

Timesearch

Mirrors

https://git.voussoir.net/voussoir/timesearch

https://github.com/voussoir/timesearch

https://gitlab.com/voussoir/timesearch

https://codeberg.org/voussoir/timesearch