2021Q3/Q4 - web-bug data pipeline overhaul

We currently have some blind spots in terms of data and analysis that prevent us from effectively answering a couple of questions we have. Luckily, the data is not too hard to gather, and pretty much all of the web-bug data has already been imported into an ElasticSearch, we just have to make that more usable.

To end up in a better state, we should

[x] ~~Update the software stack in use to recent versions.~~
[x] ~~Investigate and fix the current authentication issue, or replace it with something simpler.~~
[x] Investigate why the current Python task that imports web-bug data sometimes throws exceptions, if they're relevant, and if we can maybe port this to be webhooks-based, instead of pulling on a scheduler.
[x] ~~Re-format the already existing data from all web-bugs into a format that allows us to query individual issue events, not just the issues themselves.~~
[x] Alter/rewrite the realtime-issue-import-script to store individual events
[x] Also retrieve/fetch the comments for all issues

When that's done, we can

[x] Port the statistics from Adam's old dashboard
[x] Build new queries and dashboards to answer some of the long-standing questions we have (ToDo: Link to the document with questions and ideas once that's out of the draft state)

Optionally, if there is time left

[x] Build read-only JSON endpoints that returns the data from our ElasticSearch in a format mocking GitHub's API endpoints for issues, issue events, and comments.
- This allows us to put together some "readonly emergency fallback" system in the case we're having trouble with the GitHub API integration
- Since we'll be storing the raw JSONs as they get provided by GitHub, this should be not too complicated.
[ ] Investigate which of the metrics currently shown on the WebCompat Metrics Dashboard can be powered by data from ElasticSearch instead of our custom-built backend
- The dashboards are nice, and we want to keep them. However, duplicating work on storing data and querying data is probably not needed in some cases, and if we already have a system as powerful as ElasticSearch, we can use that to power our custom metric frontends.

Long comment, import things are bold.

As I worked on this, some changes to the plans have popped up.

Instead of updating the old stack and making it work again, I decided to pretty much start from scratch and re-build the enitre stack. The reasons:

There was no real upgrade-path for Kibana and ElasticSerach. Bonsai isn't really meant that way. I've confirmed with their support staff that the only way to upgrade this is to set up a new instance and use the cluster replication to transfer the data.
There is no export functionality in the old stack. Bonsai has deliberately disabled downloading ES snapshots, so the vendor lock-in in the old stack is huge.
The old import script, running on Python 2.7, was half broken and not only needed fixes, it also needed an upgrade to Python 3. Some dependencies are no longer supported, so rewriting everything from scratch wasn't that much more time.

The new solution is based on a collection of relatively independent components that handle the entire process in multiple steps. The new stack has a couple of nice things about it. It allows us to:

Have a full backup of all web-bugs, in a format that is pretty compatible with the GitHub API - stored API responses as JSON files on the disk. In case of emergency, we could build a read-only version of webcompat.com with relatively low effort. The JSON files are also exposed via HTTP, which allows us fast, unthrottled access to all issue data.
Import data into other databases/analysis tools if we ever have the need. We're not locked into ElasticSearch.
Use the ElasticSearch API to power some existing dashboards and tools, as well as new things, from the search index. Some of the dasboards currently ont he metrics dashboard could be backed by this, and it's also possible to build new tools like advanced search forms for example.

The code for this is done and available to the public, the stack is set up and running, and all issue data have been imported. This is currently pending a team-internal announcement, which I'll probably pre-record so that Guillaume and Kate have access to that as well.

A possible improvement here is to extend the indexer task to extract the URL from the reports data, to have both the full URL as its own field in the JSON, but also the base domain. This would allow some more advanced statistics that are hard to do based on full-text searches alone.

Al the other fields, like the Operating System or Browser Version are also available as labels, but since the indexing step is separate, we can make adjustments as we deem necessary.

The internal presentation is done. I also ported over the significant portions from Adams dashboard. Because the data is there anyway, we can adjust and expand this stuff at any time without having to "start from scratch".

I will not have time to work on more dashboards and more data-based answers this year. I partly blame the fact that this project started a bit later than I would have hoped - which was not in my control. But I'm still very happy with the state we're in now.

The initial plan had "Build read-only JSON endpoints" as an optional feature, but since I've re-built the whole stack from the ground up, this turned out to be a central feature of everything. The new stack basically downloads the JSON responses from the GitHub API, stores them on a disk, and then indexes those JSON files into ES. This means we have a full dump of all .json files for all web-bugs available in real-time on the disc, I can also just serve that over HTTP (which I'm doing), and have daily snapshots (which I'm also doing). This is a good insurance for business continuity, if GitHub decides to disable the repo again. :)

I also already have the first project based on the exposed ElasticSearch API: The Softvision team currently does something via manual work that we can completely automate using the data we have now. Unfortunately, this idea only came up this week, so I did not have a chance to work on that yet, but it's exciting nonetheless.

mozilla / webcompat-team-okrs

2021Q3/Q4 - web-bug data pipeline overhaul #202