Closed anjackson closed 2 years ago
The main downside of using Metabase is that it only really support SQL, not e.g. Solr. But given none of these things support Solr all that well, that's not that big a problem. It just means extending the current systems to push some data into a PostgreSQL or compatible DB to make analysis easy. This is likely where we were headed anyway.
It can be used as-is to generate reports on W3ACT. The data model is a bit clumsy, and it needs a read-only user to be set up, but it's not a bad idea.
Ended up deciding to move away from Metabase as it is a big new component and it wasn't really doing anything we don't already do with Grafana. It can talk to Postgres and ElasticSearch.
PostgreSQL is not as easy to get started with, but e.g. for Documents on the W3ACT DB:
SELECT
$__timeGroup(updated_at, '24h', 0) as time,
count(id) as submitted
FROM
document
WHERE
$__timeFilter(updated_at) AND status = 2
GROUP BY
time
ORDER BY time
REF:
public enum Status {
NEW,
SAVED,
SUBMITTED,
IGNORED,
DELETED;
}
This fancier version converts the status to a string, so Grafana knows to plot the Status as separate series, and does it using a temporary mapping table rather than ::text
casting.
WITH m (k, v) as (values (0, 'New'),(1, 'Saved'),(2, 'Submitted'),(3, 'Ignored'),(4, 'Deleted'))
SELECT
$__timeGroup(updated_at, '24h', 0) as time,
m.v as kind,
count(id)
FROM
document
JOIN m
ON status = m.k
WHERE
$__timeFilter(updated_at)
GROUP BY
time, kind
ORDER BY 1
This, followed by a rows-to-series transformation, give a Pie chartable overview of documents by status
WITH m (k, v) as (values (0, 'New'),(1, 'Saved'),(2, 'Submitted'),(3, 'Ignored'),(4, 'Deleted'))
SELECT
m.v as state,
count(id)
FROM document
JOIN m ON status = m.k
WHERE
$__timeFilter(updated_at)
GROUP BY state
ORDER BY 1
Similarly, but long-windedly, pulling in the Target Titles:
SELECT
target.title,
count(document.id)
FROM document
JOIN watched_target ON watched_target.id = document.id_watched_target
JOIN target ON target.id = watched_target.id_target
WHERE
$__timeFilter(document.updated_at)
GROUP BY target.title
ORDER BY count(*) DESC
LIMIT 20
With a good deal of help from all over, I've got a mostly-working SolrWayback integration, and a not-really-working Blacklight integration.
With SolrWayback, the issue is down to the dependence on url_norm
for playback, so some bits don't work because our indexes won't have url_norm
unless we reindex.
With Blacklight, the issue is that I've been trying to run two controllers in one Rails app, each pointing at one of two different Solr services. This does not seem to work, as various parts of the system expect a single CatalogController
.
Switched to just searching the WARCs with Blacklight, would be better based on Warclight as this has more build-in functionality. It'll do for now.
Updated deployment on BETA. Note that we're running the ukwa/solrwayback:docker-hub-action
version at present.
Switched to ghcr.io/ukwa/solrwayback:master
.
Recent Screenshots isn't exactly elegant, but working at https://www.webarchive.org.uk/act/nbapps/voila/render/crawler/recent-screenshots.ipynb
See the old Ingest Stack ideas: https://github.com/anjackson/ukwa-services/tree/intranet-2020/manage/intranet
These should be part of the core W3ACT stack, using the new auth mechanism, etc.
Use Metabase as main report system, rather than a faceted browser. Fall back on Voila notebooks if necessary.[ ] TrackDB browser ???[ ] Airflow Dashboard ?