The MetaLookup service extracts and suggests metadata related to OER from a given URL. To achieve this, it uses the Playwright python package to fetch the content of the URL and then analyzes the issued requests and responses and the final rendered HTML document. It further uses Lighthouse to assess the accessibility of the content. It requires iterative optimization and continuous maintenance due to the ever-changing nature of URL content.
The documentation regarding which metadata fields are extracted and how they can be found here and here. It can be rendered as HTML with the following command:
mkdocs serve
The different extractors are implemented here.
MetaLookup requires python 3.9 or later and poetry. To test the application without mocking external services, docker is required as well.
With poetry installed running the following commands will set up a poetry environment and install application and development dependencies.
poetry env use 3.9
poetry install
The code style is enforced via tools configured in .pre-commit-config.yaml and
pyproject.toml. After setting up the environment (which will install the python
pre-commit
script) one should install the hook via:
pre-commit install
This will make sure that the commits follow the defined codestyle and the respective CI pipeline step will pass.
Unit-tests are divided into two categories:
All tests make use of mocks and static (checked in) resources. Hence, they can run independently of any docker container or other environment constraints. FIXME: Issue #163
This repository follows the GitHub Flow branching model:
- Anything in the main branch is deployable
- To work on something new, create a descriptively named branch off of master (ie: new-oauth2-scopes)
- Commit to that branch locally and regularly push your work to the same named branch on the server
- When you need feedback or help, or you think the branch is ready for merging, open a pull request
- After someone else has reviewed and signed off on the feature, you can merge it into main
- Once it is merged and pushed to ‘main’, you can and should deploy immediately
The pipeline is launched by creating or updating a branch and has to succeed before any pull request can be merged. It will perform the following steps:
v{MAJOR}.{MINOR}.{PATCH}
v{MAJOR}.{MINOR}.{PATCH}
latest
and a vX.Y.Z
tagmain
tagMetaLookup requires Playwright and Lighthouse containers to be running. To deploy them together the two docker-compose files for dev and prod are used. These dockerfiles also contain the configuration to run MetaLookup with a persistent cache via a postgres container.
This container contains the main API and implementation of MetaLookup. Other services should only communicate with this container, the other containers are considered internal and should not be exposed to the outside world.
Lighthouse is a software tool that analyses accessibility of a website. This repository contains a custom dockerfile where this tool is packaged together with a browser (google-chrome-headless image). Here the accessibility command line tool is provided via a minimal custom-made HTTP API.
browserless/chrome
image)Playwright is a toolkit that allows to remote-control a browser. We here use the browserless/chrome image which essentially provides "chrome as a service" to which MetaLookup talks with the playwright python package via a websocket.
postgres
image)The postgres container provides a way to persist cache for the Extractor container. It is optional, as caching can also be done via sqlite (for a single instance of MetaLookup) or completely disabled. Alternatively a dedicated other postgres database can be used and configured. See settings.py.
The application can be configured via environment variables or a .env
file. See
settings.py
for the available options.
The service provides a /_ping
endpoint which will respond with a 200 OK
(body: {"status": "ok"}
) immediately
(within a couple of milliseconds). If the request takes to long, or answers differently, then the service is in an
inconsistent state and should be restarted. This endpoint is also configured in the docker-compose to be used for
health checks of the docker daemon to trigger automated restarts in case of unhealthy containers.
curl
When the container is running, the endpoints can be tested e.g. with curl
:
curl --location --request POST 'localhost:{your-port}/extract' \
--header 'Content-Type: application/json' \
--data-raw '{"url": "https://some-domain.org/index.html"}'
See the files api.py and models.py for the API and model documentation, or alternatively visit the Swagger-UI of the locally running service.