Redo traffic level algorithm

pushcx commented 6 years ago

I was in production.log this morning as I banned a returning spammer and decided to check on a hunch that's been growing in my mind about the traffic level that determines the color of the site logo (see ApplicationController#increase_traffic_counter):

$ grep -ho 'Traffic level: .*' log/production.log* | sort -bn --key=3 | uniq -c
      2 Traffic level: 7
      6 Traffic level: 8
     10 Traffic level: 9
      6 Traffic level: 10
      4 Traffic level: 11
      1 Traffic level: 12
      2 Traffic level: 13
      1 Traffic level: 14
      1 Traffic level: 15
      4 Traffic level: 16
      5 Traffic level: 17
      1 Traffic level: 18
      2 Traffic level: 19
      2 Traffic level: 20
      1 Traffic level: 21
      1 Traffic level: 22
      2 Traffic level: 23
      2 Traffic level: 24
      7 Traffic level: 25
      9 Traffic level: 26
     10 Traffic level: 27
      6 Traffic level: 28
     10 Traffic level: 29
      8 Traffic level: 30
      4 Traffic level: 31
     13 Traffic level: 32
      5 Traffic level: 33
      1 Traffic level: 34
      3 Traffic level: 35
      6 Traffic level: 36
      8 Traffic level: 37
     12 Traffic level: 38
      6 Traffic level: 39
     11 Traffic level: 40
     15 Traffic level: 41
      9 Traffic level: 42
      4 Traffic level: 43
      3 Traffic level: 44
      5 Traffic level: 45
      1 Traffic level: 46
      1 Traffic level: 47
      7 Traffic level: 48
      9 Traffic level: 49
     14 Traffic level: 50
     16 Traffic level: 51
     19 Traffic level: 52
      5 Traffic level: 53
      3 Traffic level: 54
      3 Traffic level: 55
      3 Traffic level: 56
      3 Traffic level: 57
      4 Traffic level: 58
      1 Traffic level: 59
     12 Traffic level: 60
     19 Traffic level: 61
     15 Traffic level: 62
      9 Traffic level: 63
      9 Traffic level: 64
      8 Traffic level: 65
      7 Traffic level: 66
      8 Traffic level: 67
     15 Traffic level: 68
      7 Traffic level: 69
      2 Traffic level: 70
      3 Traffic level: 71
      7 Traffic level: 72
     17 Traffic level: 73
     12 Traffic level: 74
      5 Traffic level: 75
      3 Traffic level: 76
      5 Traffic level: 77
     15 Traffic level: 78
     24 Traffic level: 79
     43 Traffic level: 80
     31 Traffic level: 81
     37 Traffic level: 82
     19 Traffic level: 83
     11 Traffic level: 84
     27 Traffic level: 85
     49 Traffic level: 86
     82 Traffic level: 87
    131 Traffic level: 88
    173 Traffic level: 89
    292 Traffic level: 90
    471 Traffic level: 91
    766 Traffic level: 92
   1222 Traffic level: 93
   1945 Traffic level: 94
   3211 Traffic level: 95
   5529 Traffic level: 96
  10516 Traffic level: 97
  21634 Traffic level: 98
  57246 Traffic level: 99
 119360 Traffic level: 100

This is not a useful distribution, we're spending nearly all day at 97-100%. Weekday traffic follows the American workday with pretty sharp divisions as the east coast wakes up and west coast falls asleep. The intensity algorithm doesn't reflect this, let alone account for days where traffic is genuinely higher than other days because we got twitter/yc news sending a flood our way. Maybe even this should only be based on logged-in users, to spare the db hit on every visit? Maybe there's thoughts in the git history?

I haven't thought at all about a better approach, just wanted to toss this up in the hopes someone wants a fun puzzle.

pushcx commented 6 years ago

quick graph of how extreme this distribution is:

graph

pushcx commented 6 years ago

To be explicit: I think the opportunity to not do a SELECT, just UPDATE for the current traffic RETURNING new value, push all the work to db, is probably a mandatory element of this. It's OK to drop down to raw queries for such a hot path piece of code that's called for every hit!

ngoldbaum commented 6 years ago

I bet if you colored the icon by log10(traffic) instead of just the raw traffic number you'd get a more evenly distributed curve.

alanpost commented 6 years ago

I have noticed the traffic graph is almost always the same, between ~97 and 100. Given the distribution, your could decile by frequency such that each bucket has the same number of observations. That will have the effect of putting the outliers in decile 10 and otherwise uniformly distributing the traffic.

zenzora commented 5 years ago

Hey all, long time lurker first time contributor. Saw the "good first issue" tag on this and thought it might be fun to take a crack at.

I'm thinking a possible solution would be to compare the traffic in the last arbitrary time period (maybe an hour) to the one before that.

The DB would have 4 keys

current_period_expiration: Time when the current period expires current_period_traffic: Counter for how much traffic is coming in this period last_period_traffic: Counter for how much traffic there was last period traffic_intensity: Intensity based on the relationship between traffic from current and last period at time of expiration

On each request the server will

Check if current period is expired. If it is it will 1a) compare last_period_traffic to current_period_traffic to calculate traffic_intensity 1b) set last_period_traffic to current_period_traffic 1c) increase the current_period_expiration by 1 period
Increment current_period_traffic by 1
Call set_traffic_style with traffic_intensity

Steps 1 and 2 can be skipped if user agent is a bot or if server is in read only mode. We can also do random sampling for step 2 to reduce the number of writes, so maybe only 1 out of every 100 requests increments the counter.

Things I would need input/help on

How long should each period be? Thinking an hour to start, but shorter might be better
What percentage of requests increase the current_period_traffic counter? (is there another reason we may want it to increase every time?)
How to derive intensity from the past 2 periods (I'll need more data on traffic patterns)

pushcx commented 5 years ago

Step 1 is going to need a read lock to avoid races, which is pretty expensive. Why would we compare one hour to the previous? How do we derive a 0-100 traffic_intensity from two data points, and would that cause a discontinuity at the boundary of the hour?

Random sampling in step 2 is a great idea.

alanpost commented 5 years ago

I would rather see a moving or rolling average for traffic instead of quantized times myself. You'd be able to determine which direction it was going, though alone we'd still have the problem of how extreme the distribution is.

zenzora commented 5 years ago

Good point about the read lock, it might be too expensive if we have short periods. My thought was that it would just give a quick and dirty figure based on how much traffic increased / decreased from the previous period. It would result in a delay of one period, so maybe the periods should be shorter than an hour.

I guess I should take a step back and ask if you guys already have some sort of metrics monitoring going on? Ideally if you guys had something like Prometheus already running we could just ask it where the current period ranks in relation to previous ones

pushcx commented 5 years ago

We don't have Prometheus. The full tech stack is over in the ansible repo: https://github.com/lobsters/lobsters-ansible

...which points to a much better shape of solution, like a cron job that runs every 10m and greps the nginx log for 'yyyy-mm-dd hh:m.:..', doing a single insert.

zenzora commented 5 years ago

Awesome, that's more my speed. What's the deal with elastic search? I see it installed via ansible but not finding any references to it in the application. Would it be useful to have nginx logs in ES? If so then a cron could run queries directly against it.

EDIT: Looking at the ansible repo, looks like there was some talk about installing netdata. Which would work great too, is that option still on the table? I wouldn't mind giving it a shot

alanpost commented 5 years ago

zenzora your suggestion for Prometheus (or I presume some equivalent tool) is solid, making pushcx's suggestion for running cron an effective means of getting the same information.

You'll find the ElasticSearch code in https://github.com/lobsters/lobsters/pull/579

Adding this middleware has some ongoing maintenance issues that I'm still sorting out. I am anxious to have it done but haven't made the time yet.

pushcx commented 5 years ago

Before you can go too far down on this road - I have no idea what prometheus/netdata are and would like to hear the case for adding a moving part to our deployment over Keystore/grepping logs.

zenzora commented 5 years ago

Hey Pushcx,

Prometheus is metrics gathering tool with a time series db, which might be overkill considering that your setup doesn't consist of that much infrastructure. Was only asking if you had something similar already setup.

Netdata is a lightweight application that gives you easy monitoring of servers with integrated dashboards and also contains a timeseries DB and API which we could poll to get the relevant info. The discussion I'm referring to is here:

https://github.com/lobsters/lobsters-ansible/issues/17

I'll hang around the #lobsters IRC if you want to discuss

pushcx commented 5 years ago

Took a stab at this in cc7e535. I'm going to leave this issue open a week or so to see if it feels good.

lobsters / lobsters

Redo traffic level algorithm #536