mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
1 stars 4 forks source link

Design query performance metrics around real queries #298

Closed rahulbot closed 2 hours ago

rahulbot commented 4 weeks ago

One idea for monitoring the performance of our system is to measure how some characteristic queries run, and try to improve their speed (split off from #277). Our research team shared some initial ideas about what those characteristic queries might include. I'm combining that with some of my previous ideas and some queries from our data against feminicide project here. How can we design something benchmark-y that uses the performance of these types of queries as one metric for system performance?

Simple "demo style" queries:

MEAG Research Queries:

Anti-feminicide project queries (get run every night):

pgulley commented 5 days ago

What collections are the anti-feminicide queries run against? And what timeframe?

rahulbot commented 5 days ago

Feminicide queries could be run with a "last two weeks" timeline against the following collections:

(in reality they run with varying dates using indexed_date filters to fetch stories that have appeared after the last time it ran)

pgulley commented 2 hours ago

This is now running with all the above queries in a little daily prefect task here: https://github.com/mediacloud/system-metrics. Metrics are in a dashboard titled "Daily Performance Benchmarks" on grafana

philbudne commented 34 minutes ago

I see:

stats_directory = "system-metrics.query-benchmark"

The convention I've been using for mediacloud stats (documented in story-indexers/doc/choices.md) is:

mc.REALM.PROJECT.PROGRAM.NAME[.LABEL=VALUE....]

where "mc" keeps all mediacloud stats together and "REALM" allows keeping prod/staging/user stats in the same tree