Open jesteria opened 2 years ago
I will work on porting the current netrics tests to individual python executables. I've ported the basic ping command (need to fix how it loads in custom command, currently hard-coded config). The python executable can be found under /src/netrics/measurements/builtin/
as netrics-ping.py
. @jesteria can you take a look at it and let me know if this is roughly what you had in mind? Thanks.
The following list of tests should be including in the /builtin/
(or stdlib) measurements:
netrics-ping
(simple ping to specified destinations)netrics-traceroute
(traceroute to specified desintations)netrics-ookla
(runs ookla speed test)netrics-ndt7
(runs NDT7 speed test)netrics-lml
(measures the last mile latency)netrics-dns-latency
(measure dns response time)netrics-devs
(counts the number of devices)Many of these will likely get their own issue as we develop / port them from Netrics v1.
Awesome 😎
On Fri, May 13, 2022, 1:53 PM Kyle MacMillan @.***> wrote:
The following list of tests should be including in the /builtin/ (or stdlib) measurements:
- netrics-ping (simple ping to specified destinations)
- netrics-traceroute (traceroute to specified desintations)
- netrics-ookla (runs ookla speed test)
- netrics-ndt7 (runs NDT7 speed test)
- netrics-lml (measures the last mile latency)
- netrics-dns-latency (measure dns response time)
- netrics-devs (counts the number of devices)
Many of these will likely get their own issue as we develop / port them from Netrics v1.
— Reply to this email directly, view it on GitHub https://github.com/kyle-macmillan/netrics/issues/3#issuecomment-1126353297, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEBUNSLEIR7Z42PPP53JKLVJ2QMDANCNFSM5VWDPGQQ . You are receiving this because you were mentioned.Message ID: @.***>
@kyle-macmillan – Yes looks totally reasonable.
Brief comments on the ping module that might be helpful:
[ref] I'd suggest looking at subprocess.run. You could perhaps even eliminate this helper (since it's doing essentially the same thing). This would also grant you a higher-level API, complete with a CompletedProcess
return object and exceptions for command failures with CalledProcessError
. (This might even satisfy what you were hoping to have vis-a-vis exceptions underlying the ultimate communication with the framework via exit codes.) And it just might be easier (at least when reproduced across all the modules.)
[ref] Not immediately important but these will want to at least consider the case that the configuration is unspecified, missing or malformed. I'd suggest, at least, and if appropriate to the module, to have a set of defaults. This can be added very simply:
PARAM_DEFAULTS = {…}
…
def main():
params = dict(PARAM_DEFAULTS, **json.load(sys.stdin))
[ref] Certainly no big deal, but if you specify this as an argument list (list
or simply tuple
) then there's no need for the subprocess to launch a shell – (you're not scripting but just executing ping
with those arguments). So say ('ping', '-i', …)
.
[ref] Nothing wrong with debugging output. Just keep in mind that ultimately, any such thing will need to be printed to stderr. Even with print this is trivial: print(…, file=sys.stderr)
.
[ref] Nothing bad here. Just keep in mind that on error we want to communicate it via exit code and whatever combination of stdout and stderr makes sense. I understand that here you're running ping multiple times and so perhaps you don't want to error out when ping fails once; though, I'm curious if an exit code of 2 can be a network-level error. (Perhaps it can/should be treated as a fatal error?) Anyway, with the higher-level subprocess API this could read a bit differently:
try:
ping_result = subprocess.run(['ping', '-i', …], capture_output=True, check=True)
except subprocess.CalledProcessError:
…
And I'll keep the above for reference, though admittedly in the case of ping
error codes are more meaningful than with other commands, so we might instead "check" it ourselves:
ping_result = subprocess.run(['ping', '-i', …], capture_output=True)
if ping_result.returncode == 2:
# we're done d00d!
# tho we captured it in this case let's pass on the stderr ... maybe dress it up a little
print("command 'ping' failed with exit code 2 and stderr <", ping_result.stderr, ">", file=sys.stderr)
# and let's communicate the failure via exit code ... say with ping's exit code:
sys.exit(2)
# otherwise let's send back our ping results
# this could certainly be much the same as is, though I wouldn't strictly call this an error handler or an error that were reporting. It just looks like a friendly message we're sticking in the results JSON.
res[dst]['msg'] = error_handler(ping_result.returncode, ping_result.stderr)
# Extract results from ping output
res[dst]['results'] = parse_ping_output(ping_result.returncode, ping_result.stdout)
Do bear in mind that these don't have to be entirely stand-alone scripts. If you find that there are useful utility functions you want across them, etc., they can import from the netrics package, (or even be written as modules in the package – so long as they expose some main
function – as here, just no need in that case for the top-level if __name__ == '__main__'
). But it would be good to establish what really is non-trivial shared functionality or boilerplate before deciding to go that route. (I can imagine some standard helper for reporting errors perhaps? Or even via a logging module. But I'm not really sure.)
I really don't have any preference for how results look, errors look, etc. Here you have errors in the proper results that are collected by the system. I'm suggesting an alternative since these aren't measurements; but if there's a desire for these to be included as "null results" that's certainly fine. (Though of course if we're retrying by default I'm not sure the null result is useful. In that case perhaps we really do want to consider what a fatal error that should be retried is … but I'm not yet sure.) We could also structure errors sent to stderr in much the same way (as JSON). Currently I believe we only collect results. Perhaps we should collect errors separately … or perhaps that's not useful 🤷
Final thought, not important, and back to ping specifically – any reason (not) to parallelize the pings? (This is more code but not difficult.) Just a question of whether this would be a problem for the measurements and whether it would be useful.
(Re: retries: Perhaps we'll want to require measurements to explicitly request a retry rather than doing it by default. That likely makes more sense, and better matches what we were doing before. I'm not sure ping has any need of retries. But e.g. a speedtest might get a "busy" response, and request an immediate-to-soon retry. Say, by default – sys.exit(42)
– and the framework will put the measurement into the retry queue. This could also be made configurable for any measurement with the setting retry_on: [42]
.)
Thanks for this great feedback! I will look into suggestions / basically agree with you on all points. A few comments/quesitons:
Cool 😎
I think parallelizing ping should be totally fine. I can work on that as well. I have used the multiprocess before. Do you have any suggestions?
So long as we're doing a subprocess of the ping
program (rather than a Python process-internal implementation), the multiprocessing module isn't really necessary, (tho it certainly can be used) – the "multiple processes" are the parallel pings we invoke; they don't each need their own copy of our module program to manage them. (For that matter we could use a thread pool; but, I'm willing to bet we don't need it here. Indeed we just have to manage a subprocess pool.)
(I will say, I'd be curious how much faster this sort of implementation actually makes the module … 🤔)
Anyway, keep subprocess.run
in mind! But if we're parallelizing here then we might go lower-level (kind of pseudo-code but which might nearly work):
pings_complete = [] # IFF we want a cumulative rather than iterative report
# batch arguments by max concurrency
# (iterator so we don't have to track indices, etc.)
target_stream = iter(targets) # from config or whatever
while processing := {subprocess.Process(['ping', …, target], stdout=subprocess.PIPE, stderr=subprocess.PIPE) for target in itertools.islice(target_stream, MAX_PROCESSES)}:
while processing:
# wait for at least one to complete
proc0 = next(iter(processing))
proc0.wait()
complete = {proc for proc in processing if proc.done}
processing -= complete
# if individualized reporting could just write to stdout here. if instead it's a single cumulative report we'll collect results here and write them out later.
pings_complete.extend(complete)
Want to make sure I have these things straight: Upon completion, the module prints results to stdout and errors to stderr (both as JSON objects) as well as exiting with the appropriate exit code.
Right. (Neither has to be JSON but that makes sense as a default. And writes can happen at any point during execution but at the end is fine.)
…
So precisely how the framework handles these module outputs and exit codes is not set in stone. But obviously this will inform how modules are written. My current thinking is:
• Exit code zero: The module ran without error and has meaningful results to publish. Publish those results written to stdout. (Anything written to stderr should almost certainly be handled as well – sent wherever that will go. This might include warnings or other non-erroneous informational logging for the user.) Any subprocess, such as ping, may have exited with some non-zero code; but apparently the module interpreted this as a meaningful result – say the target host is unavailable – (and ping's exit code is none of the framework's business).
• Exit code 42 (or something): The module requests to be rerun according to whatever logic we set up for that. (Likely a configurable cadence.)
• Exit code non-zero (not 42): The module encountered an error (and should not be retried except as otherwise scheduled). Maybe a subprocess had a real error that we need to flag. Maybe the module threw an uncaught exception (which I believe will generally cause it to exit with code 1).
What I'm unsure of is what to do with results printed to stdout in the event of a non-zero exit code. This might just be a bit nuanced at this point (when we're still drafting the framework). My initial thinking was that non-zero means "we're screwed" and we shouldn't even trust any results written to stdout. But I wonder if that should be left up to the module – any results it wrote were real results, and that's why it wrote them – it just so happens that it also had a real error that needs special reporting. (And perhaps the latter contract takes better advantage of the fact that we have these three degrees of freedom.) But I am curious what others might think and how this will play out. (And certainly this is something else that can be made configurable; though we at least want a reasonable default.)
…
As for that "special reporting" – particularly if we are allowing stderr to be used as a general logging channel – I wonder what we might want to have, say:
• framework-level log file (defaults to ERROR only) • receives DEBUG messages of "I will run this thing" • … INFO messages of "this thing completed successfully with this (abbreviated) stdout and this stderr" • …(perhaps WARN messages regarding retries) • … ERROR messages of "this thing failed with this non-zero exit code, this (abbreviated) stdout and this stderr"
• stderr log file: basically just what modules write to stderr. perhaps use logging (built-in or otherwise) with a configurable, default level-less signature like "{datetime} {module} {message}".
…with the two of course configurable/disableable, (paths or /dev/null, and log level filters); (and using a rotating logging handler because we're nice folks).
And since I'm getting into log file formats, I'll say that these don't extend to the stdout results. Rather, for the stdout results we should probably try to respect their serialization in an individualized measurement results file – if we receive JSON then we write a .json file somewhere for the measurement result – just as we do now. (Really we should have a default option to .gz these as well.)
And we could do the same for what's written to stderr. I'm split on what would be the most useful. My impression is that what's written there is less likely to end up inserted into anything like a time-series database. Rather, these are really standard logs – which might end up in a fancier store (even a structured store) – but which belong in this sort of rotated file format by default.
Even that needn't mean throwing out the useful structure of JSON written to stderr. We would likely want to again detect this serialization and reproduce it, if perhaps print something more compact and more legible to a log file:
2022-05-05T01:23:01 ping {flux_capacitor=true warp_speed=8 target="8.8.8.8"}
(I believe the above message format could nearly be achieved with something like json.dumps(obj, sep='=')
. Not sure how to omit the superfluous quoting of the keys. Indeed I may have just written compact TOML 😅)
…Or something. …If that helps 😉
Thanks for the feedback, I've opened a separate issue (#5) for netrics-ping
discussion.
What exit code should be raised for measurements like netrics-ping
and netrics-traceroute
that effectively run multiple tests? One idea is that netrics-ping
should ping a single destination and not take a list of destinations to ping. This effectively shift responsibility for concurrency to the controller...
Right:
0
makes sense42
)1
.Presumably, the results for all succeeding sub-tests will be written to stdout, and persisted. (This is up in the air but it seems like that's the direction we're headed, regardless of exit code.) But if the exit code is non-zero, the framework will also flag this exceptional result.
Separately, indeed, we could add support for concurrency to the controller :smile: I dunno! Should we?
(We might have a configuation key like parameters: […]
and an optional key like concurrency: N
? …For compatibility with generic measurement configuration, the parameters
value could just be merged into the measurement configuration and the framework look for this.)
I guess one important question regarding the possibility of pushing concurrency onto the framework: how should the results of concurrently-executed, parameterized tests be persisted?
If they're simply persisted as unrelated tests run around the same microsecond, I'm not sure users would be happy. (Though I'm not sure – would that suffice for you?)
On the other hand, is there a generic way for results to be joined that would make everyone happy? (And I say everyone because this might be a tough thing to make configurable.)
Say with the following measurements configuration:
ping:
parameters:
- google.com
- wikipedia.org
concurrency: 2
For the above, the framework might run the netrics-ping
command concurrently for the arguments google.com
and wikipedia.org
, perhaps by setting their stdin to something like (for example with google.com):
{
"parameter": "google.com",
"parameters": [
"google.com",
"wikipedia.org"
]
}
And then perhaps the result persisted for the collection of test executions could be generically:
{
"google.com": {
…
},
"wikipedia.org": {
…
}
}
That might well work and it might well be worthwhile. But I'm unsure that it's sufficiently generic for all conceivable modules.
Another possibility perhaps, rather than asking people to somehow write a proper reducer function into their configuration, would be to again leverage Jinja templating, to construct the results (e.g. to recreate the above proposed default):
ping:
persist: |
{
{% for name, result in results %}
"{{ name }}": {{ result }}{{ "" if loop.last else "," }}
{% endfor %}
}
(…But doesn't templating out JSON kind of stink?)
As a user of Netrics – either a "legacy" user or "new" – I expect the software to ship with a collection of built-in measurements (at least those available in the previous software). As such, existing measurement code, of potential use to us or to others, must be ported to the new Netrics framework.
This work is largely one of deletion rather than addition – measurements should be boiled down to their essential functionality, (without retry logic, etc.).
Under the Netrics framework, all measurements are simple executables.
Builtins will be installed with names prepended by
netrics-
, e.g.netrics-ping
. This convention will make their purpose clear, have the effect of "bundling" them despite their separate existence, and these names will be given special treatment in configuration: (references to the measurementping
will imply the executablenetrics-ping
, and whethernetrics-ping
is a built-in or not).However, builtins needn't be named this way in the codebase – (it is trivial to apply this convention upon installation) – within the codebase,
ping
is fine.The "ideal" or most simple measurement is a standalone script: Python or otherwise. (It must handle its own execution via shebang, etc.). Installation will copy/link each built-in measurement script to the appropriate path and under the appropriate name.
That said, Python-based measurements are free to be more sophisticated than standalone scripts. (So long as there is a module with a single function to invoke without arguments – e.g.
main()
– then installation may trivially construct a script to invoke it.) This might allow our builtins to share code (to be "DRY"), etc. Such a package of built-in measurements might be contained within thenetrics
package (atsrc/netrics/measurement/builtin/
or otherwise). (How this looks remains to be seen and is up to the implementer.)Upon execution, the measurement's contract with the framework is as follows:
ping
, then runping
.) If there's an exception, the return code should reflect this.print
,echo
, etc.) – presumably in JSON format, but the framework does not (at this point) care.*** (I'm really split on the best data format for this. JSON input and output make sense; and, if it's JSON then you'd just:
json.load(sys.stdin)
. Obviously this is more natural if the main measurements config is itself JSON or YAML. But, I think TOML is friendlier than YAML for such files. Regardless, the framework can send the measurement config in whatever format, and even let it be configurable. To the framework code, there's nothing exceptional about reading TOML and writing JSON; though, to human beings involved on either end, it might be a little odd.)