kanishka-linux / reminiscence

Self-Hosted Bookmark And Archive Manager
GNU Affero General Public License v3.0
1.74k stars 85 forks source link
archive bookmark bookmark-manager bookmarks django self-hosted selfhosted

Reminiscence

Self-hosted Bookmark and Archive manager

Table of Contents

Features

Installation

  1. First make sure that python 3.9+ (recommended version is 3.10+) is installed on system and install following packages using native package manager.

    1. virtualenv
    
    2. ~wkhtmltopdf (for html to pdf/png conversion)~ deprecated from v4.0+ due to security vulnerability.
    
        * [hlspy](https://github.com/kanishka-linux/hlspy) is now default headless browser which is based on QTWebEngine.
    
    3. hlspy (mandatory from v4.0+)
    
    4. redis-server
    
    5. chromium (optional from v0.2+)
    
    6. PyQt5
    
    7. PyQtWebEngine
  2. Installation of above dependencies in Arch or Arch based distros

    $ sudo pacman -S python-virtualenv redis chromium python-pyqt5 qt5-webengine python-pyqtwebengine
  3. Installation of above dependencies in Debian or Ubuntu based distros

    $ sudo apt install virtualenv redis-server chromium-browser python3-pyqt5 python3-pyqt5.qtwebengine
  4. Install hlspy

    $ sudo pip3 install git+https://github.com/kanishka-linux/hlspy

Note: Name of above dependencies may change depending on distro or OS, so install accordingly. Once above dependencies are installed, execute following commands, which are distro/platform independent.

Now execute following commands in terminal.

$ mkdir reminiscence

$ cd reminiscence

$ virtualenv -p python3 venv

$ python3 -m venv venv (for python3.10+)

$ source venv/bin/activate

$ cd venv

$ git clone https://github.com/kanishka-linux/reminiscence.git

$ cd reminiscence

$ source hlspy.env

$ pip install -r requirements.txt

$ mkdir logs archive tmp

$ python manage.py generatesecretkey

$ python manage.py nltkdownload

$ python manage.py migrate

$ python manage.py createsuperuser

$ python manage.py runserver 0.0.0.0:8000

open 0.0.0.0:8000 using any browser, login and start adding links

**Note:** replace localhost address with local ip address of your server

          to access web-interface from anywhere on the local network

Admin interface available at: /admin/

Setting up Celery (mandatory from v0.4 onwards):

  1. Generating PDFs and PNGs are resource intesive and time consuming. We can delegate these tasks to celery, in order to execute them in the background.

    Edit reminiscence/settings.py file and set `USE_CELERY = True`
  2. Now open another terminal in the same topmost project directory and execute following commands:

    $ sudo systemctl start redis-server
    
    $ cd venv
    
    $ source bin/activate
    
    $ cd venv/reminiscence
    
    $ source hlspy.env
    
    $ celery -A reminiscence worker --loglevel=info -c 4 --detach

Using Docker

Note: Following procedure may not work exactly from v4.0+. The dockerfiles have been updated but it is possible that users may still face some issues, so they are advised to make changes in respective Dockerfile or docker-compose as required.

Using docker is convenient compared to normal installation method described above. It will take care of configuration and setting up of gunicorn, nginx and also postgresql database along with redis and worker. (Setting and running up these three things can be a bit cumbersome, if done manually, which is described below in separate section.) It will also automatically download headless browser hlspy and nltk data set, apart from installing python based dependencies.

Note: from v4.0+, wkhtmltopdf is replaced with hlspy. Users are advised to migrate to v4.0 due to security vulnerability in wkhtmltopdf. If users are finding it difficult to migrate then they should atleast disable automatic pdf/png generation of a web-page for older reminiscence version and use chromium instead manually for pdf generation.

  1. Install docker and docker-compose

  2. Enable/start docker service. Instructions for enabling docker might be different in different distros. Sample instruction for enabling/starting docker will look like

    $ systemctl enable/start docker.service
  3. clone github repository and enter directory

    $ git clone https://github.com/kanishka-linux/reminiscence.git
    
    $ cd reminiscence
  4. build and start

    $ sudo docker-compose up --build
    
    Note: Above instruction will take some time when executed for the first time.
  5. Above step will also create default user: 'admin' with default password: 'changepassword'

  6. If IP address of server is '192.168.1.2' then admin interface will be available at

    192.168.1.2/admin/
    
    Note: In this method, there is no need to
          attach port number to IP address.
  7. Change default admin password from admin interface and create new regular user. After that logout, and open '192.168.1.2'. Now login with regular user for regular activity.

  8. For custom configuration, modify nginx.conf and dockerfiles available in the repository. After that execute step 4 again.

Note: If Windows users are facing problem in mounting data volume for Postgres, they are advised to refer this issue.

Note: Ubuntu 16.04 users might have to modify docker-compose.yml file and need to change version 3 to 2. issue

Note: For setting celery inside docker follow these instruction. Sometimes gunicorn doesn't work properly with default background task handler inside docker. In such cases users can enable celery.

Documentation

Adding Directories And Links

Automatic Tagging and Summarization

This feature has been implemented using NLTK library. The library has been used for proper tokenization and removing stopwords from sentence. Once stopwords are removed, top K high frequency words (where value of K is decided by user) are used as tags. In order to generate summary of HTML content, score is alloted to a sentence based on frequency of non-stopwords contained in it. After that highests score sentences (forming 1/3'rd of total content) are used to generate summary. It is one of the simplest methods for automatic tagging and summarization, hence not perfect. It can't tag group of meaningful words. e.g. It will not consider 'data structure' as a single tag. Supporting multi-word tags is in TODO list of the project.

About summarization, there are many advance methods which may give even more better results, which users can find in this paper. Both these feature needs to be activated from Settings box. It is off by default.

reminiscence

Reader mode

Once user will open link using inbuilt reader, the application will try to present text content, properly formatted for mobile devices whenever possible. In reader mode user will also find options Original, PDF and PNG, at the top header. These options will be available only when user has archived the link in those formats. Options for selecting archive file format is available in every user's Settings box. If Original, format is selected then users can see the text content along with original stylesheet and linked images. Javascript will be removed from original file format due to security reasons. If page can't be displayed due to lack of javascript then users have to depend on either PDF or full-page PNG formats.

reminiscence

Generating PDF and PNG

PDF and full-page screenshot in PNG format of HTML page will be generated using wkhtmltopdf. It is headless tool but in some distro it might not be packaged with headless feature. In such cases, users have to run it using Xvfb. In order to use it headlessly using Xvfb, set USE_XVFB = True in reminiscence/settings.py file and then install xvfb using command line.

Note: Use Xvfb, only when wkhtmltopdf is not packaged with headless feature.

Note: Alternatively Users can also download official headless wkhtmltopdf for their resepctive distro/OS from here. Only problem is that, users will have to update the package manually on their own for every new update.

Why not use Headless Chromium?

Currently headless chromium doesn't support full page screenshot, otherwise I might have used it blindly. There is another headless browser hlspy, based on QtWebEngine, which I built for my personal use. hlspy can generate entire html content, pdf document and full page screenshot in one single request and that too using just one single process. In both chromium and wkhtmltopdf, one has to execute atleast two separate processes for doing the same thing. The main problem with hlspy is that it is not completely headless, it can't run without X. It requires xvfb for running in headless environment.

In future, I'll try to provide a way to choose between different backends (i.e. chromium, wkhtmltopdf or hlspy) for performing these tasks.

Note: From v0.2+ onwards, support for headless Chromium has been added for generating HTML and PDF content. Users can use this feature if default archived content has some discrepancies. Users need to install Chromium to use this feature.

Archiving Media Elements

Note: This feature is available from v0.2+ onwards

  1. In settings.py file add your favourite download manager to DOWNLOAD_MANAGERS_ALLOWED list. Default are curl and wget. In case of docker based method users have to make corresponding changes in dockersettings.py file. For large arbitrary files with direct download links, curl and wget are good enough. For complex use cases users will need something like youtube-dl, which they have to install and manage on their own and needs to be added to the DOWNLOAD_MANAGERS_ALLOWED list.

  2. open web-interface settings box and add command to Download Manager Field:

    ex: wget {iurl} -O {output}
    
    iurl -> input url
    output -> output path
    
    OR
    
    ex: youtube-dl {iurl} -o {output}
  3. Users should not substitute anything for {iurl} and {output} field, they should be kept as it is. In short, users should just write regular command with parameters and leave the {iurl} and {output} field untouched. (Note: do not even remove curly brackets).

  4. Reminiscence server will take care of setting up of input url i.e. {iurl} and output path field i.e. {output}.

  5. If user is using youtube-dl as a download manager, then it is advisable to install ffmpeg along with it. In this case user has to take care of regular updating of youtube-dl on their own. In docker based installation, users have to add installation instructions for ffmpeg in Dockerfile; and then need to modify requirements.txt and add youtube_dl as dependency.

  6. Web-interface settings box also contains, streaming option. If this option is enabled, then HTML5 compliant media files can be played inside browsers, otherwise they will be available for download on the client machine.

  7. If users are upgrading from older version then they are advised to apply database migration using following commands, before using new features:

    python manage.py makemigrations
    
    python manage.py migrate
  8. Finally, when adding url to any directory just prepend md: to url, so that the particular entry will be recognized by custom download manager.

    ex=> md:https://some-website-with-media-link.org/media-link
    
    Every entry added by this way will be treated as containing media
    
    or single arbitrary file with direct download link.
  9. Archived files are normally saved in archive folder. Users can change location of this folder via settings.py file. Users should note that in order to archive media files, the archive location should not contain any space. e.g. archive location '/home/user/my downloads/archive' is not allowed. However location without space '/home/user/my_downloads/archive' is allowed.

  10. By default, archived media links are not shared with anyone. However, users can create public links for some fixed time. Once a public link has been created, it will remain valid for 24 hours. Users can change this value by changing value of VIDEO_ID_EXPIRY_LIMIT in settings.py. These public links are also useful for playing non-HTML5 compliant archived media on regular media players like mpv/mplayer/vlc etc..It is also possible to generate a playlist in m3u format for a directory containing media links, which can be played by any popular media player.

Annotation And Read-it-later Feature

This is the latest feature and available from v0.3+ onwards. This feature allows addition, deletion and modification of annotation.

Annotation support works well on desktop browsers. On mobile, this feature works mostly on firefox (for annotating html/pdf/epub).

How to use this feature on desktop browsers?

How to use this feature on mobile firefox?

How these featues have been implemented?

REST API

Reminiscence uses Django Rest Framework for exposing few functionalities via REST endpoints (available from v0.3+ onwards).

Following are few examples of API usage using cURL

  1. Login and get auth token (POST) /restapi/login/

    $ curl -d username=mypy -d password=foobarbaz http://127.0.0.1:8000/restapi/login/

Token obtained with above request needs to be passed to every subsequent request header. We'll call the token as AUTHTOKEN for rest of the examples.

  1. Add url to Reminiscence instance in a specific directory (POST) /restapi/add-url/

    $ curl -H "Authorization: Token AUTHTOKEN" -d url="https://mr.wikipedia.org" -d media_link=no -d directory="/Wiki/Marathi" http://127.0.0.1:8000/restapi/add-url/
  2. List all urls added to a specific directory (POST) /restapi/list-added-urls/

    $ curl -H "Authorization: Token AUTHTOKEN" -d directory="/Wiki" http://127.0.0.1:8000/restapi/list-added-urls/
  3. List all directories (GET) /restapi/list-directories/

    $ curl -H "Authorization: Token AUTHTOKEN" http://127.0.0.1:8000/restapi/list-directories/
  4. Logout and remove token (GET) /restapi/logout/

    $ curl -H "Authorization: Token AUTHTOKEN" http://127.0.0.1:8000/restapi/logout/

Running Tests

Note Taking

From v0.3 onwards, users can add arbitrary note to their collection. Support for note taking is rudimentary and provides note taking using simple WYSIWYG editor.

For adding note use following command in the input url box:

    note:New note

    above command will create *New note* in the current folder

Public-Private-Group directories

By default, all directories and all links are private and are not shared with anyone. However, users can select one public directory and one group directory from all available directories for sharing links. User can set public and group directory via settings. Links placed in public directory will be available for public viewing and links placed in group directory will be available for pre-determined list of users selected by account holder.

Public links of a user can be accesed at the url:

    /username/profile/public

Group links of a user can be accesed by pre-determined group of users at the url:

    /username/profile/group

Searching Bookmarks

Bookmarks can be searched according to title, url, tag or summary by using search menu available at the top most navigation bar. By default bookmarks will be searched according to title. In order to search according to url, tag or summary, users have to prefix url:, tag:, or sum: to the search term, in the search box.

Note: Special search prefix tag-wall: will display all available tags.

About Database

By default, reminiscence uses sqlite database, but users can replace it with any database supported by django ORM like postgresql. Some simple instructions for using postgresql with django are available here . Users can also take a look at this wiki, for proper postgresql database setup. There might be some changes in the instructions depending on the OS and distributions you are using.

Understanding Settings Files

reminiscence folder contains three settings files

1. settings.py

2. defaultsettings.py

3. dockersettings.py

Gunicorn plus Nginx setup

(optional)

to it. After that collect staticfiles using command

    $ python manage.py collectstatic

Once staticfiles of admin have been collected in the static folder, users should remove STATIC_ROOT from settings.py, before running the web server.

Motivation

Till few years back, I used to think that once something has been published on the web, it is going to remain there forever in some form or other. But web of today is different. Now we never know, when some valuable web resource (like web-pages, images, text, pdf etc...) will disappear from the web completely. There might be variety of reasons for disappearance (e.g. author of resource lost interest in maintaining it, low traffic or some other political-economic reasons). I don't want to go into details, but there are plenty of reasons due to which web-resource that we savoured in the past, might become un-available in the future. If we are lucky, then we may find mirrors of popular sites of the past, archived by volunteers. But, the same can't be said true of obscure and rare web content. So, I decided to develop Reminiscence for saving personal memories of the web effectively and in a well organized manner, which somehow turned into a hybrid of bookmark manager and personal wayback machine.