This is a package to extract a domain, title, publication date, text, and language content from the URL or text of an online news story. The methods for each are extracted from the larger Media Cloud project, but also build on numerous 3rd party libraries. The metadata extracted includes:
Other often-reused methods and configuration related to the mediacloud service also live in this package.
pip install mediacloud-metadata
If you pass in a URL, it will follow redirects and fetch the HTML for you.
from mcmetadata import extract
metadata = extract(url="https://my.awesome.news/story-path")
You can also pass in HTML you already have on hand. Note that in this case it is also useful to pass in the URL because that is used for some for some of the metadata extraction.
from mcmetadata import extract
metadata = extract(url="https://my.awesome.news/story-path",
html_text="<html><head><title>my webpage ... </html>")
If you are interested in adding code to this module, first clone the GitHub repository.
flit install
pre-commit install
pytest
pytest
to make sure all the test passpyproject.toml
CHANGELOG.md
about what changesv*.*.*
Test are run against fixtures by default. This can be changed with the use of '--use-cache=False' when running tests. When adding new tests, re-run 'scripts/get-test-web-content.py'
Created as part of the Media Cloud Project. Contributes include: