Content Collector

Use this awesome tool to download

Images
Gifs
Image Albums
Videos
Comments

...which you've marked as Liked, Hearted, or Saved from

Reddit
Tumblr
+Pixiv+
+Pinterest+

...to disk! You can then browse the results.

/Crossed out = These sites do not have official APIs, so breakage can happen much easier. As of 2021-07-04, they are both non-functioning./

* Content Collector is moving! UPDATE:* For the near to mid-term, Content Collector will remain on GitHub.

Content Collector will be moving off of GitHub in the future.

+Please use ~https://macoy.me/code/macoy/Content-Collector.git~ instead.+ ** Project status I use Content Collector nearly every day, so it is still a neat and useful project. You can expect very good (though not perfect) Reddit support, a good offline media browser, and a local media scanning setup.

However, sites which lack suitable download APIs cause this type of project to be a constant battle or arms race against the site updating its security measures.

As a software developer, this kind of battle isn't one I want to fight. As such, you shouldn't expect any more updates to this project in regards to new site support or major features.

I think the ultimate solution to the media downloading problem is a web browser with strong automatic image/video downloading integrations.

Content Collector may still be valuable in that world by offering a good way to browse the content you've downloaded, but its automatic downloading is likely to fall into disrepair.

I'm still quite proud of this project, and hope that the other users have gotten good value out of it. Directions ** 0. Check Releases Check the [[https://github.com/makuto/Liked-Saved-Image-Downloader/releases][Releases]] page for a ready-to-use version of Content Collector. If you find a release for your system that works, you can skip straight to the Usage section. 1. Clone this repository

+BEGIN_SRC sh

git clone https://github.com/makuto/Liked-Saved-Image-Downloader

+END_SRC

* 2. Install python dependencies ** Using Poetry [[https://python-poetry.org/][Poetry]] can be used to automatically request the proper dependencies:

+BEGIN_SRC sh

sudo pip3 install poetry

If you are going to do -E security, you may have to:

sudo apt install libffi-dev

poetry install -E security

+END_SRC

Note: ~poetry install -E security~ is only necessary if you want to use authentication and SSL encryption (which is recommended). ~poetry install~ is sufficient if you do not want those features. **** The old way This is the manual way to install the dependencies. Dependencies will be installed to your system, and virtual environments will not be used.

The following dependencies are required:

+BEGIN_SRC sh

pip install praw pytumblr ImgurPython jsonpickle tornado youtube-dl git+https://github.com/ankeshanand/py-gfycat@master git+https://github.com/upbit/pixivpy py3-pinterest

+END_SRC

You'll want to use Python 3, which for your environment may require you to specify ~pip3~ instead of just ~pip~.

***** Login-Protected Server

If you want to require the user to login before they can interact with the server, you must install passlib:

+BEGIN_SRC sh

pip install passlib bcrypt argon2_cffi

+END_SRC

*** 3. Generate SSL keys

+BEGIN_SRC sh

cd Liked-Saved-Image-Downloader/ ./Generate_Certificates.sh

+END_SRC

This step is only required if you want to use SSL, which ensures you have an encrypted connection to the server. You can disable this by opening ~LikedSavedDownloaderServer.py~ and setting ~useSSL = False~.

*** 4. Run the server

+BEGIN_SRC sh

poetry run python3 LikedSavedDownloaderServer.py

+END_SRC

**** Starting the server on boot If you want to use Systemd, do the following:

+BEGIN_SRC sh

sudo cp content-collector.service /etc/systemd/system/content-collector.service sudo systemctl enable content-collector sudo systemctl start content-collector

+END_SRC

When updating the server, you can use the following command to restart it:

+BEGIN_SRC sh

sudo systemctl restart content-collector

+END_SRC

***** Using ~cron~ instead You can also use cron, but it's more of a hassle to stop/restart the server:

+BEGIN_SRC sh

Must be root account for access to port 443 (or 80 for unsecured servers)

sudo crontab -e

+END_SRC

Add this to the file that opens for editing (customize path to your liking), then save that file:

+BEGIN_SRC sh

@reboot cd /home/pi/ContentCollector && sudo poetry run python3 LikedSavedDownloaderServer.py 2>&1 | tee LikedSavedServer.log

+END_SRC

Reboot your system to start the server. **** If you did not use poetry

+BEGIN_SRC sh

python3 LikedSavedDownloaderServer.py

+END_SRC

** Usage *** Access the server Open [[https://localhost:8888][localhost:8888]] in any web browser.

If your web browser complains about the certificate, you may have to click ~Advanced~ and add the certificate as trustworthy, because you've signed the certificate and trust yourself :).

/(Explanation: this certificate isn't trusted by your browser because you created it. It will still protect your traffic from people snooping on your LAN)./

If you want to get rid of this, you'll need to get a signing authority like ~LetsEncrypt~ to generate your certificate, and host the server under a proper domain. *** Set password When first running the server, you will be prompted to set a password.

/If you forget your password, simply delete passwords.txt/.

*** Home page

The home page provides access to all server features:

[[file:images/Homepage.png]]

*** Set up accounts

Use Settings to configure the script:

[[file:images/LikedSavedSettings.png]]

Make sure to click "Save Settings" before closing the page.

You don't have to fill in every field, only the accounts you want.

*** Download content Go to the Download Content page and click "Download new content":

[[file:images/DownloadContent.png]]

Wait until the downloader finishes (it will say "Finished" at the bottom of the page). While the downloader is running, the "Download new content" button will disappear.

*** Browse content Enjoy! Use Browse Content to jump to random content you've downloaded, or browse your output directory:

[[file:images/LikedSavedBrowser.png]]

The browser should scale nicely to work on both mobile and desktop.

** Login management

The script requires login before running the script, changing settings, or browsing downloaded content.

If you host Content Collector on the internet, you should rely on a more robust authentication scheme (e.g. use a reverse proxy which won't proxy requests to Content Collector until you have authenticated with the proxy server). Content Collector was designed for LAN use.

Note that all login cookies will be invalidated each time you restart the server. If you don't restart the server, your browser should remember login indefinitely.

*** Managing passwords(s)

The web interface will automatically prompt for a new password when first starting up.

You can also use ~PasswordManager.py~ to generate a file ~passwords.txt~ with your hashed (and salted) passwords:

+BEGIN_SRC sh

python3 PasswordManager.py "Your Password Here"

+END_SRC

You can create multiple valid passwords, if desired. There are no separate accounts, however.

If you want to reset all passwords, simply delete ~passwords.txt~.

*** Disabling Login

Open ~LikedSavedDownloaderServer.py~ and find ~enable_authentication~. Set it equal to ~False~. You must restart the server for this to take effect. ** Running the script only

This is deprecated. You should use the web server to configure and run the script instead.

Copy ~settings_template.txt~ into a new file called ~settings.txt~
Open ~settings.txt~
Fill in your username and password
Set ~SHOULD_SOFT_RETRIEVE~ to ~False~ if you are sure you want to do this
Run the script: ~python redditUserImageScraper.py~
Wait for a while
Check your output directory (the default is ~output~ relative to where you ran the script) for all your images!

If you want more images, set ~Reddit_Total_Requests~ and/or ~Tumblr_Total_Requests~ to a higher value. The maximum is 1000. Unfortunately, reddit does not allow you to get more than 1000 submissions of a single type (1000 liked, 1000 saved).

Not actually getting images downloaded, but seeing the console say it downloaded images? Make sure ~SHOULD_SOFT_RETRIEVE=False~ in ~settings.txt~

~settings.txt~ has several additional features. Read the comments to know how to use them.

** OSX Python issues On OSX, running the downloader from the Content Collector server may cause this error:

+BEGIN_SRC sh

Output: output objc[29889]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called.

+END_SRC

This is a problem with Python and OSX's security model clashing. See [[https://github.com/ansible/ansible/issues/32499][this issue]] for an explanation.

To work around it, you need to first run

+BEGIN_SRC sh

export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

+END_SRC

...before running the Content Collector server in that same terminal.

Or add the bash profile suggested in [[https://stackoverflow.com/questions/50168647/multiprocessing-causes-python-to-crash-and-gives-an-error-may-have-been-in-progr][this answer]].

** Issues

Feel free to create Issues on this repo if you need help. I'm friendly so don't be shy.

makuto / Liked-Saved-Image-Downloader

readme

+BEGIN_SRC sh

+END_SRC

+BEGIN_SRC sh

If you are going to do -E security, you may have to:

+END_SRC

+BEGIN_SRC sh

+END_SRC

+BEGIN_SRC sh

+END_SRC

+BEGIN_SRC sh

+END_SRC

+BEGIN_SRC sh

+END_SRC

+BEGIN_SRC sh

+END_SRC

+BEGIN_SRC sh

+END_SRC

+BEGIN_SRC sh

Must be root account for access to port 443 (or 80 for unsecured servers)

+END_SRC

+BEGIN_SRC sh

+END_SRC

+BEGIN_SRC sh

+END_SRC

+BEGIN_SRC sh

+END_SRC

+BEGIN_SRC sh

+END_SRC

+BEGIN_SRC sh

+END_SRC