m-lab / traceroute-caller

A sidecar service which runs traceroute after a connection closes
Apache License 2.0
18 stars 5 forks source link

Add hop annotation support #120

Closed SaiedKazemi closed 3 years ago

SaiedKazemi commented 3 years ago

This commit adds hop annotation support (hopannotation1 datatype) to traceroute-caller.

For details, run the following command:

go doc -all hopannotation

The contents of a typical hop annotation file looks like the following:

  {
    "ID": "20210824_2d638011386a_91.189.91.39",
    "Timestamp": "2021-08-24T15:27:59Z",
    "Annotations": {
      "Geo": {
    "ContinentCode": "NA",
    "CountryCode": "US",
    "CountryName": "United States",
    "Subdivision1ISOCode": "MA",
    "Subdivision1Name": "Massachusetts",
    "MetroCode": 506,
    "City": "Boston",
    "PostalCode": "02112",
    "Latitude": 42.3562,
    "Longitude": -71.0631,
    "AccuracyRadiusKm": 200
      },
      "Network": {
    "CIDR": "91.189.88.0/21",
    "ASNumber": 41231,
    "ASName": "Canonical Group Limited",
    "Systems": [
      {
        "ASNs": [
          41231
        ]
      }
    ]
      }
    }
  }

The changes were successfully tested with:

go test ./... go test -race ./... docker-compose up


This change is Reviewable

coveralls commented 3 years ago

Pull Request Test Coverage Report for Build 365


Changes Missing Coverage Covered Lines Changed/Added Lines %
connectionlistener/connectionlistener.go 13 15 86.67%
hopannotation/hopannotation.go 112 118 94.92%
<!-- Total: 125 133 93.98% -->
Files with Coverage Reduction New Missed Lines %
tracer/scamper.go 2 94.29%
<!-- Total: 2 -->
Totals Coverage Status
Change from base Build 355: -0.7%
Covered Lines: 612
Relevant Lines: 644

💛 - Coveralls
gfr10598 commented 3 years ago

hopannotation/hopannotation.go, line 168 at r1 (raw file):

      hc.hopsLock.Lock()
      if _, ok := hc.hops[hop]; !ok {
          hc.hops[hop] = inserted

Please add a metric. I'm inclined to think we need to be better about adding metrics before pushing to prod. Since this is a candidate for prod, it should have prom metrics.

SaiedKazemi commented 3 years ago

.gitignore, line 1 at r1 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Not clear to me that .gitignore should be in source code control. Prefer that you remove it.

.gitignore is tracked to share the rule list with other users. There are more than 100 M-Lab repos that track .gitignore.

SaiedKazemi commented 3 years ago

connectionlistener/connectionlistener.go, line 34 at r1 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
I think the hopCaches for yesterday and today should be managed here. There should be a single hopcache reference, which should be replaced with a new empty HopCache at the time of the first event after midnight. The HopCache should include a date field (or method) that indicates what date it covers.

After reading through your comments I realized that the "sanity check" code that I had added during debugging has caused confusion.

There is technically no need to keep yesterday's cache. It can be orphaned by Clear() at midnight and garbage collected.

I have now removed oldHops and setState().

SaiedKazemi commented 3 years ago

connectionlistener/connectionlistener.go, line 84 at r1 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
At this point, grab a reference to the HopCache. If time.Now is not the same date as the HopCache, then replace cl.HopCache with a new one. hopCache := cl.GetTodaysHopCache() // thread safe - replaces HopCache with new one if the date has just changed.

Per my earlier comment, there's no need to do this.

SaiedKazemi commented 3 years ago

connectionlistener/connectionlistener.go, line 100 at r1 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
At this point, we can determine whether to use yesterday's HopCache, or today's. We can now check if the value of priorHopCache is correct, or not. if output.CycleStart.StartTime different date than priorHopCache { hopCache = cl.GetTodaysHopCache() // Which should be the new one. }

Per my earlier comment, there is no need to do this.

SaiedKazemi commented 3 years ago

connectionlistener/connectionlistener.go, line 105 at r1 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Also check for non-zero length?

If err is nil, annotations is not zero length. Even if it is zero length, it doesn't cause any issues. But I added the check for non-zero length as you have suggested.

SaiedKazemi commented 3 years ago

hopannotation/hopannotation.go, line 46 at r1 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
I like these states, but it would be more useful to have metrics than to keep these here. Both is fine, of course.

These for sanity checks. I have removed them and added metrics instead.

SaiedKazemi commented 3 years ago

hopannotation/hopannotation.go, line 74 at r1 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
This structure makes it hard to reason about concurrency, and makes the code complicated. Instead, HopCache should handle only one map, and there should be more than one HopCache. Ideally, yesterday's HopCache should only have references from traces in flight, and once all the traces complete and the hops are annotated, it could be garbage collected. However, since we don't know until we get the traceroute json back what the cyclestart time will be, it is hard to hold a reference to the correct cache. The traceroute might be launched at 23:59.99999, and cyclestart might be 00:00.000001. Thinking out loud: If we provide Trace with an indirect reference to the currentHopCache (with a Get method) and a direct reference to the recentHopCache that is determined prior to calling Trace, then trace can determine which one to use based on cyclestart time. Most of the time, the currentHopCache.Get() will return the same value as recentHopCache, but near midnight they might be different. Then Trace uses the cyclestart time to determine which cache should be used for the hop annotations.

Per my earlier comments, I have removed oldHops and entryState. The cache is now map[string]bool.

SaiedKazemi commented 3 years ago

hopannotation/hopannotation.go, line 76 at r1 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Does this need to be maintained, or can the old hop map just be orphaned? If we have to keep it around, please document why.

There's no oldHops anymore.

SaiedKazemi commented 3 years ago

hopannotation/hopannotation.go, line 80 at r1 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
The time checking can be done by every call to the cache.

I have reworked the code after removing sanity checks.

SaiedKazemi commented 3 years ago

hopannotation/hopannotation.go, line 116 at r1 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Should probably use a swap rather than a load, to completely avoid races.

As the comment says, this is solely needed for better code coverage via package testing and can be removed if it's an issue.

SaiedKazemi commented 3 years ago

hopannotation/hopannotation.go, line 168 at r1 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Please add a metric. I'm inclined to think we need to be better about adding metrics before pushing to prod. Since this is a candidate for prod, it should have prom metrics.

Done.

SaiedKazemi commented 3 years ago

hopannotation/hopannotation.go, line 249 at r1 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
It would be much better if the state handling is done by the handler using a reference on the entry. It shouldn't be necessary to access this through the cache.

All of this is gone.

SaiedKazemi commented 3 years ago

hopannotation/hopannotation.go, line 269 at r1 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
This is unnecessarily complicated. old vs new map should be determined based on cyclestart time, not by searching both caches. It would be best if the annotateHop function worked on a single cache, either yesterdays or todays. I'll add comments elsewhere to clarify this.

All of it is gone.

SaiedKazemi commented 3 years ago

PTAL.

SaiedKazemi commented 3 years ago

I can respond early.

SaiedKazemi commented 3 years ago

hopannotation/hopannotation.go, line 51 at r2 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
BTW, I'm a fan of good labels, but IIUC, they add to the metrics footprint multiplicatively, so it is good to be cautious about the dimensionality. In other words, will we get good value from both labels here? I've added labels in etl pipeline metrics that I later ended up regretting.

This is my first experience with Prometheus metrics, so let's discuss interactively when we meet.

SaiedKazemi commented 3 years ago

hopannotation/hopannotation.go, line 64 at r2 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
You treat this elsewhere as a time.Duration, which is in nanoseconds.

Made it more clear.

SaiedKazemi commented 3 years ago

hopannotation/hopannotation.go, line 108 at r2 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Probably better to pass tickerDuration into the goroutine as a parameter. I think that avoids the need for atomic access. (Probably moot due to design flaw).

It's not moot because package testing code needs to work with much shorter intervals in order to avoid long waits for midnight. I am now passing it in.

SaiedKazemi commented 3 years ago

hopannotation/hopannotation.go, line 109 at r2 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Really confusing to treat a Duration as something other than a nanosecond time.Duration value. Please do the scaling where you create the Duration above, and rename tickerIntervalMilliseconds. (Probably moot due to design flaw).

It's not moot. Per my earlier comment, I made it clear.

SaiedKazemi commented 3 years ago

hopannotation/hopannotation.go, line 116 at r2 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Is this backwards? Seems like it should clear the cache if time.Now() > BTW, I've had some success testing this kind of code using monkey patch to patch the time.Now function.

No, it's not backwards. Hours go from 0 to 23 and then back to 0 again. Therefore, when the current hour is less than the previous hour, we have passed midnight.

SaiedKazemi commented 3 years ago

hopannotation/hopannotation.go, line 146 at r2 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Since this list of errors will be returned to caller, and method is exported, the error types should also be exported.

Done.

SaiedKazemi commented 3 years ago

hopannotation/hopannotation.go, line 156 at r2 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
I think there is a fundamental race between this loop, and the goroutine that replaces the map. It will occasionally result in some of the hops being registered in yesterday's map, and others being registered in today's map. If this results in duplicate entries, it is no big deal, but I think it can also result in a hop from yesterday's trace being recorded in today's map, so there won't be a hop recorded in the correct date partition, and the JOIN will later fail in the ETL pipeline. So, I think you want to replace the map on demand, in this Annotate method, when the trace cyclestart time does not correspond to the date for the map. This will then result in the same problem I mentioned in previous comment

If the cache is reset while we are iterating this loop, it means that midnight has passed. Therefore, the remaining hops in the hops slice will be annotated which is exactly the behavior we want.

I will add a comment explaining this in the code.

SaiedKazemi commented 3 years ago

hopannotation/hopannotation.go, line 156 at r2 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Probably better to hold the lock outside the loop.

We don't have to lock the map for the whole duration. Please see my comment below.

SaiedKazemi commented 3 years ago

hopannotation/hopannotation.go, line 116 at r1 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Ok - had to think about what you meant for a bit. Now I'm curious why there is a race in testing. It seems like the only goroutine that should access this field is the go func here. So, IIUC, 'hour' could be a local variable within this function.

You've mentioned in a future comment that it's clear how hour is used to fake passage of midnight.

SaiedKazemi commented 3 years ago

hopannotation/hopannotation_test.go, line 31 at r1 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
But this is making my head explode a little, because I've always thought of the tests in a table as independent, rather than as a sequence of things that can alter state. I think I like it though. However, the use of the hop as a test directive seems a bit of a stretch. Instead, I'd suggest adding a boolean 'clear_cache' indicating whether the hopcache should be cleared before proceeding. BTW: https://dave.cheney.net/2019/05/07/prefer-table-driven-tests suggests using a map from name to test struct, which I quite like. (I'm still trying to adjust to table-driven being the preferred structure).

Done.

SaiedKazemi commented 3 years ago

hopannotation/hopannotation.go, line 116 at r2 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Just put this in a comment in the code or at the field definition.

Done.

SaiedKazemi commented 3 years ago

hopannotation/hopannotation.go, line 156 at r2 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
To be clear, this is a nice to have - not required.

Done.

SaiedKazemi commented 3 years ago

hopannotation/hopannotation.go, line 156 at r2 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
We discussed this in further detail on the VC.

Made the hop cache time-aware, so we should never miss a hop annotation for today even if the traceroute ran at 23:59:59 and finished after midnight.

SaiedKazemi commented 3 years ago

hopannotation/hopannotation.go, line 170 at r4 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
So, IIUC, this means that should yesterday's hops sneak into today's cache, they won't matter - they will be ignored when new traces with today's timestamp are checked. Seems like this should address the issue. TNHIWHDI Thanks!

What does TNHIWHDI mean?!

SaiedKazemi commented 3 years ago

hopannotation/hopannotation_test.go, line 108 at r1 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Ugh. Now I see the race problem. This would be much better written with monkey patch. Then you can control the progression of time explicitly, or implicitly. There are some subtleties, but it would avoid having to leak 'hour' outside the module, and would make the test code much simpler. While testability often makes code more complicated, I'd really like to avoid additional complexity just to support testing, whenever possible. I'm not crazy about monkey patch, but it has the advantage of keeping the complexity within the test code, and avoiding mucking up the production code. Again, I'd be ok adding a TODO and issue, and fixing this later, particularly as it is intertwined with the hopcache race condition we talked about on the VC call earlier.

Ack'ed. I have added a TODO to use money patch in a later PR.

SaiedKazemi commented 3 years ago

hopannotation/hopannotation.go, line 111 at r3 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Can this parameter be a time.Duration?

Done.

SaiedKazemi commented 3 years ago

Thanks for your time and review.

SaiedKazemi commented 3 years ago

hopannotation/hopannotation.go, line 170 at r4 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
That's not how I would have done it. I think it is from Extreme Programming. I didn't mean to leave that in - got distracted and sent by mistake.

LOL :). I googled it, but nothing came up!