Set up release of data from the green web DB

mrchrisadams commented 5 years ago

We need to make release of the content in the greenweb database, which would ideally be the same as the API results, to make documentation easier:

For reference responses look like so (abridged, to remove the bits for the browser extension we don't use):

{
  "green": true,
  "url": "www.thegreenwebfoundation.org",
  "data": true,
  "hostedby": "LeaseWeb",
  "hostedbyid": 156,
  "hostedbywebsite": "www.leaseweb.com",
}

We'd need to query the greencheck table, then for each url we have we'd ideally need the result of the latest check.

Todo

[ ] transfer across GreencheckDataDumpCommand CLI command to dump to a csv file (we may need to update this)
[ ] transfer across GreencheckCsvCheckerCommand to make sure we have the top 1m urls in redis (we might make this work a separate queue to avoid messing with performance)
[ ] update current caching code to write to the datastructure we'd pull from for the CSV file
[ ] Decide about adding the checked_on value

arendjantetteroo commented 5 years ago

Not sure if this big table is the best way to query for this as it might be really slow. Did you already try it?

A better way might be storing the latest result of each url in another table (or redis) depending on how much data this is?

mrchrisadams commented 5 years ago

Yeah, it's slow.

To get an idea of how long it might take, I ran this query to count how many unique urls we have:

  SELECT
    count(DISTINCT url)
  FROM
    `greencheck`
  ORDER BY url`

The result was something like 20m records, so I figure it would be similar. Is the payload above is 166 characters and we have 20m of them that works out to something like the following in bytes:

166 * 20_000_000 = 3,320,000,000

I think that's a would be three and a half gig. You can fit in memory, but I'm not sure how much space we have set aside for Redis.

Running the query

Even if we kept this in Redis, I think we'd still need to run some hairy SQL query once to populate the datastructure, and if not, I figure we'd still need to run this once every month or so anywhere.

mrchrisadams commented 5 years ago

Okay, just checked the stats on Redis.

From what I can see, we're using less than half a gig of memory to store between 300k and 400k keys.

It looks like we do have a decent amount of headroom memory wise in that box, so I'm gonna look at whether it's possible to store the latest result in Redis, for a longer time period of time than the current default of a few hours.

That way, cutting a release would be a case of dumping the contents of these keys to a csv file, and be

a) easy to understand and write code for b) no real load on MySQL when we run this export.

Looks like I'll have to dive into the greencheck code in more detail to see what's going on for this though.

arendjantetteroo commented 5 years ago

Check this class: https://github.com/thegreenwebfoundation/thegreenwebfoundation/blob/ff452d5e3bfa5ecd18f621e320d56edbd1ce419e/apps/api/src/Greencheck/Logger.php#L103

It already pushes to redis for the latest results on the homepage. Adding another redis call here should be easy.

Op ma 15 apr. 2019 18:27 schreef Chris Adams notifications@github.com:

Okay, just checked the stats on Redis.

From what I can see, we're using less than half a gig of memory to store between 300k and 400k keys.

It looks like we do have a decent amount of headroom memory wise in that box, so I'm gonna look at whether it's possible to store the latest result in Redis.

That way, cutting a release would be dumping the contents of these keys to a csv file, and be

a) easy to understand and write code for b) no real load on MySQL

Looks like I'll have to dive into the greencheck code in more detail to see what's going on for this though…

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/thegreenwebfoundation/thegreenwebfoundation/issues/9#issuecomment-483323327, or mute the thread https://github.com/notifications/unsubscribe-auth/AArhajojI-dp21YZCmijG2hFIhdXXDTBks5vhKhXgaJpZM4cwOPr .

mrchrisadams commented 5 years ago

OK, it looks fairly straight forward this part:

We'd typically use HSET for this, something like so:

$latest = new LatestResult();
$latest->setResult($result);

$this->redis->lpush('latest_checks', json_encode($latest));
$this->redis->ltrim('latest_checks', 0, 999);

// set it into our longer lasting cache, which we generate CSVs from, or use as a cachinglayer in the API
$this->$redis->hset("domains:$latest->url", json_encode($latest));

Just adding this will create the datastructure we'd need to query when fetching data from now on.

If we run the greenchecks against the current list of unique urls, we could populate the datastructure, and end up with a nice caching layer in the API too.

How the API might use this:

If in the API we had the option of only checking against Redis synchronously, but making the full greencheck async, we'd end up with a decoupling that meant that we could take MySQL offline if we need to.

Also, because Redis is backed by the file system, we'd be able to rebuild a full cache quickly if we need to serve results.

Green-Web-Foundation-Components-API-API-Server-Caching1

This would give us:

easy export for CSV
a caching layer that is decoupled from the rest of the system with a nice scaling story, if MySQL goes down for any reason (we still capture requests in RabbitMQ, so we don't lose any data, even if we need to update MySQL, or fiddle around with tables).

For now though.

I'm just going to focus on creating the datastructure. Updating the API would be a separate ticket, as it likely involves some CI setup of its own

mrchrisadams commented 5 years ago

hey @arendjantetteroo if we have this, I'm assuming the we we might run put this into the app would be to put a list of urls, onto RabbitMQ, using enqueue, and then let the workers pull the requests off.

This would give us all the extra bits like tracking of checks for stats, and generate pretty much the same responses that the API would generate, and presumably putting a big list of urls, is faster than running a single process that could crash at any point and lose where we were in the queue.

The downside is that it we might need to put them on a separate queue, with separate workers, as this is lower priority than new incoming requests from actual users.

I'll have a look around to see how to set queues and workers myself, but does this approach sound okay to you?

And I'm assuming we'd call the check function in the Checker class, the one that has this syntax:

public function check($url, $ip, $browser, $source, $blind)
  // do stuff to check result and log it, which puts it into our redis hash
}

arendjantetteroo commented 5 years ago

If we let the redis results pushing as you indicated above with HSET run for about a week, we would already have a pretty big result set there. I think at that point it would be acceptable to always return from redis, even if there is no result. The next query onto the api would return a result as the worker queue would by then have done a lookup. I think there are soo many variables/caches involved right now that eventual consistency is already at play and the first lookup having a wrong result or returning an "unknown" state would be fine. The extension already has a ? for the case where we don't have enough data (not sure on the specifics, need to look at the code)

In that case we can avoid the whole async waiting for the queue setup, each api call just does a HMGET with the requested urls. And then puts a message onto the queue but doesn't wait for an answer as it doesn't need it anymore. So this would decouple the whole greencheck lookup code from the api part. Which would make the api a lot faster as well.

So as i see it this would need the following changes:

logResult has the HSET added
the api code gets the result from redis and returns it + pushes a new message onto the queue as we currently do but we don't wait for the result.

Bonus we can make this indeed configurable or like a write-through cache where if redis is empty we still do the async queue method like we currently have.

to fill the redis cache we could create a script to grab all unique results from mysql and push them all into the queue.

mrchrisadams commented 5 years ago

Cool @arendjantetteroo, thanks for the extra background - that's really useful to understand.

This sounds like we can deploy in stages, and see that the datastructure is building up as we expect, before making the changes on the API.

Even without the API changes, if once the datastructure is in Redis, making the CSV generation option seems pretty straightforward. I'll make a PR for these changes.

arendjantetteroo commented 5 years ago

Yep exactly, we can leave the api parts intact for now and start with just the redis store/csv part :)

Might be wise anyway to see what results end up in redis and how big this gets

arendjantetteroo commented 4 years ago

Shall we close this one @mrchrisadams ?

thegreenwebfoundation / greencheck-api