ukwa / ukwa-services

Deployment configuration for all UKWA services stacks.
Apache License 2.0
4 stars 5 forks source link

Integrate reporting systems into W3ACT stack #39

Closed anjackson closed 2 years ago

anjackson commented 3 years ago

See the old Ingest Stack ideas: https://github.com/anjackson/ukwa-services/tree/intranet-2020/manage/intranet

These should be part of the core W3ACT stack, using the new auth mechanism, etc.

anjackson commented 3 years ago

The main downside of using Metabase is that it only really support SQL, not e.g. Solr. But given none of these things support Solr all that well, that's not that big a problem. It just means extending the current systems to push some data into a PostgreSQL or compatible DB to make analysis easy. This is likely where we were headed anyway.

It can be used as-is to generate reports on W3ACT. The data model is a bit clumsy, and it needs a read-only user to be set up, but it's not a bad idea.

anjackson commented 2 years ago

Ended up deciding to move away from Metabase as it is a big new component and it wasn't really doing anything we don't already do with Grafana. It can talk to Postgres and ElasticSearch.

PostgreSQL is not as easy to get started with, but e.g. for Documents on the W3ACT DB:

SELECT
  $__timeGroup(updated_at, '24h', 0) as time,
  count(id) as submitted
FROM
  document
WHERE
  $__timeFilter(updated_at) AND status = 2
GROUP BY
  time
ORDER BY time

REF:

    public enum Status {
        NEW,
        SAVED,
        SUBMITTED,
        IGNORED,
        DELETED;
    }
anjackson commented 2 years ago

This fancier version converts the status to a string, so Grafana knows to plot the Status as separate series, and does it using a temporary mapping table rather than ::text casting.

WITH m (k, v) as (values (0, 'New'),(1, 'Saved'),(2, 'Submitted'),(3, 'Ignored'),(4, 'Deleted'))

SELECT
  $__timeGroup(updated_at, '24h', 0) as time,
  m.v as kind,
  count(id)
FROM
  document
JOIN m
  ON status = m.k
WHERE
  $__timeFilter(updated_at)
GROUP BY
  time, kind
ORDER BY 1
anjackson commented 2 years ago

This, followed by a rows-to-series transformation, give a Pie chartable overview of documents by status

WITH m (k, v) as (values (0, 'New'),(1, 'Saved'),(2, 'Submitted'),(3, 'Ignored'),(4, 'Deleted'))

SELECT
  m.v as state,
  count(id)
FROM document
JOIN m ON status = m.k
WHERE
  $__timeFilter(updated_at)
GROUP BY state
ORDER BY 1
anjackson commented 2 years ago

Similarly, but long-windedly, pulling in the Target Titles:

SELECT
  target.title,
  count(document.id)
FROM document
JOIN watched_target ON watched_target.id = document.id_watched_target
JOIN target ON target.id = watched_target.id_target
WHERE
  $__timeFilter(document.updated_at)
GROUP BY target.title
ORDER BY count(*) DESC
LIMIT 20
anjackson commented 2 years ago

With a good deal of help from all over, I've got a mostly-working SolrWayback integration, and a not-really-working Blacklight integration.

With SolrWayback, the issue is down to the dependence on url_norm for playback, so some bits don't work because our indexes won't have url_norm unless we reindex.

With Blacklight, the issue is that I've been trying to run two controllers in one Rails app, each pointing at one of two different Solr services. This does not seem to work, as various parts of the system expect a single CatalogController.

anjackson commented 2 years ago

Switched to just searching the WARCs with Blacklight, would be better based on Warclight as this has more build-in functionality. It'll do for now.

anjackson commented 2 years ago

Updated deployment on BETA. Note that we're running the ukwa/solrwayback:docker-hub-action version at present.

anjackson commented 2 years ago

Switched to ghcr.io/ukwa/solrwayback:master.

anjackson commented 2 years ago

Recent Screenshots isn't exactly elegant, but working at https://www.webarchive.org.uk/act/nbapps/voila/render/crawler/recent-screenshots.ipynb