seanbreckenridge / url_cache

A file system cache which saves URL metadata and summarizes content
https://pypi.org/project/url-cache/
Apache License 2.0
10 stars 1 forks source link
cache metadata opengraph subtitles url youtube

PyPi version Python3.8|3.9|3.10|3.11 PRs Welcome

This is currently very alpha and in development, so expect changes to the API/interface. It aims to walk the line between extracting enough text/data for it to be useful, but no so much that it takes enormous amounts of space.

As it stands I'm sort of pessimistic this would ever be a silver bullet, getting useful info out of arbitrary HTML is hard, so you're sort of stuck writing parsers for each website you're interested in. However, I still use this frequently, especially as a cache for API information like described below

Current TODOs:

Installation

Requires python3.8+

To install with pip, run:

python3 -m pip install url_cache

As this is still in development, for the latest changes install from git: python3 -m pip install git+https://github.com/seanbreckenridge/url_cache

Rationale

A file system cache which saves URL metadata and summarizes content

This is meant to provide more context to any of my tools which use URLs. If I watched some youtube video and I have a URL, I'd like to have the subtitles for it, so I can do a text-search over all the videos I've watched. If I read an article, I want the article text! This requests, parses and abstracts away that data for me locally, so I can do something like:

>>> from url_cache.core import URLCache
>>> u = URLCache()
>>> data = u.get("https://sean.fish/")
>>> data.metadata["images"][-1]
{'src': 'https://raw.githubusercontent.com/seanbreckenridge/glue/master/assets/screenshot.png', 'alt': 'screenshot', 'type': 'body_image', 'width': 600}
>>> data.metadata["description"]
"sean.fish; Sean Breckenridge's Home Page"

If I ever request that URL again, the information is grabbed from a local cache instead.

Generally, this uses:

Site-Specific Extractors:

This is meant to be extendible -- so its possible for you to write your own extractors/file loaders/dumpers (for new formats (e.g. srt)) for sites you use commonly and pass those to url_cache.core.URLCache to extract richer data for those sites. Otherwise, it saves the information from lassie and the summarized HTML using readability for each URL.

To avoid scope creep, this probably won't support:

Usage:

In Python, this can be configured by using the url_cache.core.URLCache class: For example:

import logging
from url_cache.core import URLCache

# make requests every 2 seconds
# debug logs
# save to a folder in my home directory
cache = URLCache(loglevel=logging.DEBUG, sleep_time=2, cache_dir="~/Documents/urldata")
c = cache.get("https://github.com/seanbreckenridge")
# just request information, don't read/save to cache
data = cache.request_data("https://www.wikipedia.org/")

For more information, see the docs

The CLI interface provides some utility commands to get/list information from the cache.

Usage: url_cache [OPTIONS] COMMAND [ARGS]...

Options:
  --cache-dir PATH                Override default cache directory location
  --debug / --no-debug            Increase log verbosity
  --sleep-time INTEGER            How long to sleep between requests
  --summarize-html / --no-summarize-html
                                  Use readability to summarize html. Otherwise
                                  saves the entire HTML document

  --skip-subtitles / --no-skip-subtitles
                                  Skip downloading Youtube Subtitles
  --subtitle-language TEXT        Subtitle language for Youtube Subtitles
  --help                          Show this message and exit.

Commands:
  cachedir  Prints the location of the local cache directory
  export    Print all cached information as JSON
  get       Get information for one or more URLs Prints results as JSON
  in-cache  Prints if a URL is already cached
  list      List all cached URLs

An environment variable URL_CACHE_DIR can be set, which changes the default cache directory.

API Cache Examples

I've also successfully used this to cache responses from API results in some of my projects, by subclassing and overriding the request_data function. I just make a request and return a summary, and it transparently caches the rest. See:

CLI Examples

The get command emits JSON, so it could with other tools (e.g. jq) used like:

$ url_cache get "https://click.palletsprojects.com/en/7.x/arguments/" | \
  jq -r '.[] | .html_summary' | lynx -stdin -dump | head -n 5
Arguments

   Arguments work similarly to [1]options but are positional. They also
   only support a subset of the features of options due to their
   syntactical nature. Click will also not attempt to document arguments
$ url_cache export | jq -r '.[] | .metadata | .title'
seanbreckenridge - Overview
Arguments — Click Documentation (7.x)
url_cache list --location
/home/sean/.local/share/url_cache/data/2/c/7/6284b2f664f381372fab3276449b2/000
/home/sean/.local/share/url_cache/data/7/5/1/70fc230cd88f32e475ff4087f81d9/000
# to make a backup of the cache directory
$ tar -cvzf url_cache.tar.gz "$(url_cache cachedir)"

Accessible through the url_cache script and python3 -m url_cache

Implementation Notes

This stores all of this information as individual files in a cache directory. In particular, it MD5 hashes the URL and stores information like:

.
└── a
    └── a
        └── e
            └── cf0118bb22340e18fff20f2db8abd
                └── 000
                    ├── data
                    │   └── subtitles.srt
                    ├── key
                    ├── metadata.json
                    └── timestamp.datetime.txt

In other words, this is a file system hash table which implements separate chaining.

You're free to delete any of the directories in the cache if you want, this doesn't maintain a strict index, it uses a hash of the URL and then searches for a matching key file.

By default this waits 5 seconds between requests. Since all the info is cached, I use this by requesting all the info from one data source (e.g. my bookmarks, or videos I've watched recently) in a loop in the background, which saves all the information to my computer. The next time I do that same loop, it doesn't have to make any requests and it just grabs all the info from local cache.

Originally created for HPI.


Testing

git clone 'https://github.com/seanbreckenridge/url_cache'
cd ./url_cache
pip install '.[testing]'
mypy ./src/url_cache
flake8 ./src/url_cache
pytest