internet-equity / netrics

The network measurements scheduling framework
7 stars 3 forks source link

built-in measurements #3

Open jesteria opened 2 years ago

jesteria commented 2 years ago

As a user of Netrics – either a "legacy" user or "new" – I expect the software to ship with a collection of built-in measurements (at least those available in the previous software). As such, existing measurement code, of potential use to us or to others, must be ported to the new Netrics framework.

This work is largely one of deletion rather than addition – measurements should be boiled down to their essential functionality, (without retry logic, etc.).


Under the Netrics framework, all measurements are simple executables.

Builtins will be installed with names prepended by netrics-, e.g. netrics-ping. This convention will make their purpose clear, have the effect of "bundling" them despite their separate existence, and these names will be given special treatment in configuration: (references to the measurement ping will imply the executable netrics-ping, and whether netrics-ping is a built-in or not).

However, builtins needn't be named this way in the codebase – (it is trivial to apply this convention upon installation) – within the codebase, ping is fine.

The "ideal" or most simple measurement is a standalone script: Python or otherwise. (It must handle its own execution via shebang, etc.). Installation will copy/link each built-in measurement script to the appropriate path and under the appropriate name.

That said, Python-based measurements are free to be more sophisticated than standalone scripts. (So long as there is a module with a single function to invoke without arguments – e.g. main() – then installation may trivially construct a script to invoke it.) This might allow our builtins to share code (to be "DRY"), etc. Such a package of built-in measurements might be contained within the netrics package (at src/netrics/measurement/builtin/ or otherwise). (How this looks remains to be seen and is up to the implementer.)

Upon execution, the measurement's contract with the framework is as follows:

*** (I'm really split on the best data format for this. JSON input and output make sense; and, if it's JSON then you'd just: json.load(sys.stdin). Obviously this is more natural if the main measurements config is itself JSON or YAML. But, I think TOML is friendlier than YAML for such files. Regardless, the framework can send the measurement config in whatever format, and even let it be configurable. To the framework code, there's nothing exceptional about reading TOML and writing JSON; though, to human beings involved on either end, it might be a little odd.)

kyle-macmillan commented 2 years ago

I will work on porting the current netrics tests to individual python executables. I've ported the basic ping command (need to fix how it loads in custom command, currently hard-coded config). The python executable can be found under /src/netrics/measurements/builtin/ as netrics-ping.py. @jesteria can you take a look at it and let me know if this is roughly what you had in mind? Thanks.

kyle-macmillan commented 2 years ago

The following list of tests should be including in the /builtin/ (or stdlib) measurements:

Many of these will likely get their own issue as we develop / port them from Netrics v1.

jesteria commented 2 years ago

Awesome 😎

On Fri, May 13, 2022, 1:53 PM Kyle MacMillan @.***> wrote:

The following list of tests should be including in the /builtin/ (or stdlib) measurements:

  • netrics-ping (simple ping to specified destinations)
  • netrics-traceroute (traceroute to specified desintations)
  • netrics-ookla (runs ookla speed test)
  • netrics-ndt7 (runs NDT7 speed test)
  • netrics-lml (measures the last mile latency)
  • netrics-dns-latency (measure dns response time)
  • netrics-devs (counts the number of devices)

Many of these will likely get their own issue as we develop / port them from Netrics v1.

— Reply to this email directly, view it on GitHub https://github.com/kyle-macmillan/netrics/issues/3#issuecomment-1126353297, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEBUNSLEIR7Z42PPP53JKLVJ2QMDANCNFSM5VWDPGQQ . You are receiving this because you were mentioned.Message ID: @.***>

jesteria commented 2 years ago

@kyle-macmillan – Yes looks totally reasonable.

Brief comments on the ping module that might be helpful:

jesteria commented 2 years ago

(Re: retries: Perhaps we'll want to require measurements to explicitly request a retry rather than doing it by default. That likely makes more sense, and better matches what we were doing before. I'm not sure ping has any need of retries. But e.g. a speedtest might get a "busy" response, and request an immediate-to-soon retry. Say, by default – sys.exit(42) – and the framework will put the measurement into the retry queue. This could also be made configurable for any measurement with the setting retry_on: [42].)

kyle-macmillan commented 2 years ago

Thanks for this great feedback! I will look into suggestions / basically agree with you on all points. A few comments/quesitons:

jesteria commented 2 years ago

Cool 😎

I think parallelizing ping should be totally fine. I can work on that as well. I have used the multiprocess before. Do you have any suggestions?

So long as we're doing a subprocess of the ping program (rather than a Python process-internal implementation), the multiprocessing module isn't really necessary, (tho it certainly can be used) – the "multiple processes" are the parallel pings we invoke; they don't each need their own copy of our module program to manage them. (For that matter we could use a thread pool; but, I'm willing to bet we don't need it here. Indeed we just have to manage a subprocess pool.)

(I will say, I'd be curious how much faster this sort of implementation actually makes the module … 🤔)

Anyway, keep subprocess.run in mind! But if we're parallelizing here then we might go lower-level (kind of pseudo-code but which might nearly work):

pings_complete = []  # IFF we want a cumulative rather than iterative report

# batch arguments by max concurrency
# (iterator so we don't have to track indices, etc.)
target_stream = iter(targets)  # from config or whatever

while processing := {subprocess.Process(['ping', …, target], stdout=subprocess.PIPE, stderr=subprocess.PIPE) for target in itertools.islice(target_stream, MAX_PROCESSES)}:

    while processing:
        # wait for at least one to complete
        proc0 = next(iter(processing))
        proc0.wait()

        complete = {proc for proc in processing if proc.done}

        processing -= complete

        # if individualized reporting could just write to stdout here. if instead it's a single cumulative report we'll collect results here and write them out later. 
        pings_complete.extend(complete)

Want to make sure I have these things straight: Upon completion, the module prints results to stdout and errors to stderr (both as JSON objects) as well as exiting with the appropriate exit code.

Right. (Neither has to be JSON but that makes sense as a default. And writes can happen at any point during execution but at the end is fine.)

So precisely how the framework handles these module outputs and exit codes is not set in stone. But obviously this will inform how modules are written. My current thinking is:

• Exit code zero: The module ran without error and has meaningful results to publish. Publish those results written to stdout. (Anything written to stderr should almost certainly be handled as well – sent wherever that will go. This might include warnings or other non-erroneous informational logging for the user.) Any subprocess, such as ping, may have exited with some non-zero code; but apparently the module interpreted this as a meaningful result – say the target host is unavailable – (and ping's exit code is none of the framework's business).

• Exit code 42 (or something): The module requests to be rerun according to whatever logic we set up for that. (Likely a configurable cadence.)

• Exit code non-zero (not 42): The module encountered an error (and should not be retried except as otherwise scheduled). Maybe a subprocess had a real error that we need to flag. Maybe the module threw an uncaught exception (which I believe will generally cause it to exit with code 1).

What I'm unsure of is what to do with results printed to stdout in the event of a non-zero exit code. This might just be a bit nuanced at this point (when we're still drafting the framework). My initial thinking was that non-zero means "we're screwed" and we shouldn't even trust any results written to stdout. But I wonder if that should be left up to the module – any results it wrote were real results, and that's why it wrote them – it just so happens that it also had a real error that needs special reporting. (And perhaps the latter contract takes better advantage of the fact that we have these three degrees of freedom.) But I am curious what others might think and how this will play out. (And certainly this is something else that can be made configurable; though we at least want a reasonable default.)

As for that "special reporting" – particularly if we are allowing stderr to be used as a general logging channel – I wonder what we might want to have, say:

• framework-level log file (defaults to ERROR only) • receives DEBUG messages of "I will run this thing" • … INFO messages of "this thing completed successfully with this (abbreviated) stdout and this stderr" • …(perhaps WARN messages regarding retries) • … ERROR messages of "this thing failed with this non-zero exit code, this (abbreviated) stdout and this stderr"

• stderr log file: basically just what modules write to stderr. perhaps use logging (built-in or otherwise) with a configurable, default level-less signature like "{datetime} {module} {message}".

…with the two of course configurable/disableable, (paths or /dev/null, and log level filters); (and using a rotating logging handler because we're nice folks).

And since I'm getting into log file formats, I'll say that these don't extend to the stdout results. Rather, for the stdout results we should probably try to respect their serialization in an individualized measurement results file – if we receive JSON then we write a .json file somewhere for the measurement result – just as we do now. (Really we should have a default option to .gz these as well.)

And we could do the same for what's written to stderr. I'm split on what would be the most useful. My impression is that what's written there is less likely to end up inserted into anything like a time-series database. Rather, these are really standard logs – which might end up in a fancier store (even a structured store) – but which belong in this sort of rotated file format by default.

Even that needn't mean throwing out the useful structure of JSON written to stderr. We would likely want to again detect this serialization and reproduce it, if perhaps print something more compact and more legible to a log file:

2022-05-05T01:23:01 ping {flux_capacitor=true warp_speed=8 target="8.8.8.8"}

(I believe the above message format could nearly be achieved with something like json.dumps(obj, sep='='). Not sure how to omit the superfluous quoting of the keys. Indeed I may have just written compact TOML 😅)

…Or something. …If that helps 😉

kyle-macmillan commented 2 years ago

Thanks for the feedback, I've opened a separate issue (#5) for netrics-ping discussion.

What exit code should be raised for measurements like netrics-ping and netrics-traceroute that effectively run multiple tests? One idea is that netrics-ping should ping a single destination and not take a list of destinations to ping. This effectively shift responsibility for concurrency to the controller...

jesteria commented 2 years ago

Right:

Presumably, the results for all succeeding sub-tests will be written to stdout, and persisted. (This is up in the air but it seems like that's the direction we're headed, regardless of exit code.) But if the exit code is non-zero, the framework will also flag this exceptional result.


Separately, indeed, we could add support for concurrency to the controller :smile: I dunno! Should we?

(We might have a configuation key like parameters: […] and an optional key like concurrency: N? …For compatibility with generic measurement configuration, the parameters value could just be merged into the measurement configuration and the framework look for this.)

jesteria commented 2 years ago

I guess one important question regarding the possibility of pushing concurrency onto the framework: how should the results of concurrently-executed, parameterized tests be persisted?

If they're simply persisted as unrelated tests run around the same microsecond, I'm not sure users would be happy. (Though I'm not sure – would that suffice for you?)

On the other hand, is there a generic way for results to be joined that would make everyone happy? (And I say everyone because this might be a tough thing to make configurable.)

Say with the following measurements configuration:

ping:
  parameters:
    - google.com
    - wikipedia.org
  concurrency: 2

For the above, the framework might run the netrics-ping command concurrently for the arguments google.com and wikipedia.org, perhaps by setting their stdin to something like (for example with google.com):

{
  "parameter": "google.com",
  "parameters": [
    "google.com",
    "wikipedia.org"
  ]
}

And then perhaps the result persisted for the collection of test executions could be generically:

{
  "google.com": {
    …
  },
  "wikipedia.org": {
    …
  }
}

That might well work and it might well be worthwhile. But I'm unsure that it's sufficiently generic for all conceivable modules.


Another possibility perhaps, rather than asking people to somehow write a proper reducer function into their configuration, would be to again leverage Jinja templating, to construct the results (e.g. to recreate the above proposed default):

ping:
  persist: |
    {
      {% for name, result in results %}
      "{{ name }}": {{ result }}{{ "" if loop.last else "," }}
      {% endfor %}
    }

(…But doesn't templating out JSON kind of stink?)