ovis-hpc / ldms

OVIS/LDMS High Performance Computing monitoring, analysis, and visualization project.
https://github.com/ovis-hpc/ovis-wiki/wiki
Other
98 stars 50 forks source link

Discussion of new sampler multi-instance plugin API #444

Closed morrone closed 1 year ago

morrone commented 4 years ago

If I understand correctly, sampler_base.[ch] is assuming that there is only one metric set per schema. This can be seen in base_set_new(), where it overwrites base->set without checking the current value, and in base_data_t where there is only storage for single schema and a single set.

LDMS itself does not make this assumption. There can be many sets that are based on the same schema. So this implementation in sampler_base can be very limiting. It will be an issue, for instance, for #362 if those plugins wish to use sampler_base. I was considering using sampler_base for some of my external samplers as well, and this limitation currently prevents its use.

I am thinking of fixing this problem. While it would probably be better to enrich the sampler plugin API to make sampler_base obsolete, that is probably too much of a change in the near term.

My first thought is to break up base_data_t into a structure more associated with the schema, and another more associated with sets. But I may think of another approach as I go along.

It also occurs to me that sampler_base should perhaps offer more functions that directly wrap underlying sampler API functions. Trying to combine too many into fewer functions also reduces the functionality.

If I've missed anything, let me know. Otherwise, I may proceed with the work to break the 1-to-1 schema-to-set assumption in the near future.

morrone commented 4 years ago

Ah, a simpler way to fix this is probably to just not store the set in the base_datat, and instead pass it to base* functions as needed. The data in base_data_t is largly common amongst all sets that use the same schema. So a new data type may not be necessary. I'll see if that pans out.

morrone commented 4 years ago

Bug that I'll fix along the way: sampler_base calls ldms_set_publish, but doesn't ever call ldms_set_unpublish().

tom95858 commented 4 years ago

@morrone this is all refactored in the multi-instance plugin logic and was specifically done to manage this kind of thing. This will be back-ported from master to OVIS-4. It was scheduled to be done later rather than sooner, but maybe we can stage parts of the implementation that are less disruptive sooner.

morrone commented 4 years ago

@tom95858 , is all of that code on master?

Looking at master, I do see a number of things to really like about that code. Some of the headers seem to have much better comments, and I like the clear sorting of the plugins by type in the source directory and API.

I do have some feedback before that code lands:

Since we are breaking the API, can we please, please not return a struct of function pointers from the plugins? (And calling the function that returns that struct "new()" is less than desirable, but that would go away if we follow this suggestion).

Instead, lets just have a set of well-documented function names that a plugins export. This is much better for at least one key reason: It is much easier to add and deprecate functions to the API if they aren't in a struct. If the functions are passed in a struct, then a deprecated function pretty much lives forever (which we saw with the previous API). The entire struct needs to be deprecated to clear out old functions. Not so with individual functions. Old functions, presuming they are optional, can simply no longer be dladd'ed when they are not longer needed. As long as the API change is backwards compatible, there is no breakage due to passing pointers in a struct.

Likewise new API functions can be added without changing the struct published in the API.

It looks like the "base" metrics are rolled into the normal API, which is great. That is definitely where I was hoping things would go. But I really think this needs a change before it can land:

I admit I may not understand the code enough yet, but it raises red flags for me to see "create_set" and "create_set_group" returning the same data type of ldms_set_t. Is ldms_set_t a set or a group of sets? Both? Why not be clearer about what they are?

Can we stop hiding pointers in typedefs? I tend to think that is slightly evil, and I'm not alone on that.

I don't see why we are creating "ldmsd_plugin_inst_t". This type doesn't seem to serve a purpose. The plugins should simply return a pointer to their instance state in a "void ", and the type passed to them should also be void . Let the plugin decide on the name it gives to its state structure. It is opaque to the calling ldmsd anyway. Making it explicitly void * makes this clear, whereas giving it a know type implies to the reader that ldmsd knows the structure's contents. That is misleading.

I haven't found whether this implementation handles base metrics yet. But I might have comments on that as well.

All in all, there are lots of things to like about the new code. But I do think that it needs some additional work still before we adopt it as the new API. I'd like to engage in that conversation when you guys are ready (I know that will be post-4.3.4).

In the mean time, I can see that improving sampler_base on OVIS-4 would be wasted effort. So I'll probably just implement job_id handling in my external plugins on my own for now.

I will retitle this issue to be a place to track the discussion on the new plugin API.

tom95858 commented 4 years ago

@tom95858 , is all of that code on master?

Yes

Looking at master, I do see a number of things to really like about that code. Some of the headers seem to have much better comments, and I like the clear sorting of the plugins by type in the source directory and API.

I do have some feedback before that code lands:

Since we are breaking the API, can we please, please not return a struct of function pointers from the plugins? (And calling the function that returns that struct "new()" is less than desirable, but that would go away if we follow this suggestion).

The idea is that each plugin has multiple instances that are configured differently. The 'new' function returns an 'instance' of the plugin in which it can store it's state. Please feel free to propose a different name.

Instead, lets just have a set of well-documented function names that a plugins export. This is much better for at least one key reason: It is much easier to add and deprecate functions to the API if they aren't in a struct. If the functions are passed in a struct, then a deprecated function pretty much lives forever (which we saw with the previous API). The entire struct needs to be deprecated to clear out old functions. Not so with individual functions. Old functions, presuming they are optional, can simply no longer be dladd'ed when they are not longer needed. As long as the API change is backwards compatible, there is no breakage due to passing pointers in a struct.

Likewise new API functions can be added without changing the struct published in the API.

Know that the plugin interface has been laying around for close to a decade, there is detritus there. Shame on us for not being more proactive about cleaning it up.

dlsym() is not particularly zippy and calling it six gazillion times is certainly sub-optimal. Ok, well cache the symbol address ... where ... in a struct! We could certainly build that struct dynamically, but it feels a bit overwrought to me.

I'm inclined to be more diligent about keeping the code clean, current and simple. It is a noble goal to support a dynamic infrastructure; but to what end and at what cost? I understand the desire to support a plugin in perpetuity, but I'm not convinced that it's a worthwhile task.

It looks like the "base" metrics are rolled into the normal API, which is great. That is definitely where I was hoping things would go. But I really think this needs a change before it can land:

Sure, this is not set in stone, and review and change recommendations to the plugin interface are welcome.

I admit I may not understand the code enough yet, but it raises red flags for me to see "create_set" and "create_set_group" returning the same data type of ldms_set_t. Is ldms_set_t a set or a group of sets? Both? Why not be clearer about what they are?

Can we stop hiding pointers in typedefs? I tend to think that is slightly evil, and I'm not alone on that.

Ehh...sure. Sometimes it's evil. If the intention is that that type is an opaque handle, then I think wrapping it is not just OK, but preferred.

If it's just a short hand, and the caller accesses the contents of the object, then it is reasonable to argue that this is obfuscation. On the other hand, typing 'struct ldms_metric_value_s *' six hundred times instead of ldms_value_t might get us sued for causing some poor soul carpel tunnel.

I don't see why we are creating "ldmsd_plugin_inst_t". This type doesn't seem to serve a purpose. The plugins should simply return a pointer to their instance state in a "void ", and the type passed to them should also be void . Let the plugin decide on the name it gives to its state structure. It is opaque to the calling ldmsd anyway. Making it explicitly void * makes this clear, whereas giving it a know type implies to the reader that ldmsd knows the structure's contents. That is misleading.

I think some of the contents of the plugin instance are public and others are private. The ldmsd_plugin_inst_t is the public part.

I haven't found whether this implementation handles base metrics yet. But I might have comments on that as well.

All in all, there are lots of things to like about the new code. But I do think that it needs some additional work still before we adopt it as the new API. I'd like to engage in that conversation when you guys are ready (I know that will be post-4.3.4).

agreed.

In the mean time, I can see that improving sampler_base on OVIS-4 would be wasted effort. So I'll probably just implement job_id handling in my external plugins on my own for now.

I will retitle this issue to be a place to track the discussion on the new plugin API.

thanks @morrone

morrone commented 4 years ago

dlsym() is not particularly zippy and calling it six gazillion times is certainly sub-optimal. Ok, well cache the symbol address ... where ... in a struct! We could certainly build that struct dynamically, but it feels a bit overwrought to me.

I don't think that is overwrought at all. It is simple and straight forward to implents, and Its how I've seen it implemented on probably a dozen other software projects with plugins. There is a big advantage to keeping the struct in ldmsd and keeping a clean abstraction layer: we can change and rearange the implentation at will without impacting the API. Which is why so many other projects do it this way. I'm happy to work on that when the time comes.

I'm inclined to be more diligent about keeping the code clean, current and simple. It is a noble goal to support a dynamic infrastructure; but to what end and at what cost?

What I described pretty much is the end. A cleaner simpler API for plugin authors. The struct is an unnecessary added step for plugin authors, and just adds one extra layer of obfuscation in their implenatation. Plugins would be easier to read (because all of the API entry points would have the same name between plugins, instead of having the extra lookup table to deal with) and implement. And the cost is pretty darn minor to put the struct in the caller instead of the callee. Like I said, I think I'd be happy ( (and have the time) to work on it if we start this work sometime in the next few months.

I am speaking as an outsider that came in trying to write plugins for ldms. It was harder than it needed to be. I'm trying to lower those barriers for other developers to engage in ldms plugin development. Having a cleaner API (and documenting its semantics better) is in pursuit of that goal.

And if you don't want to keep the current API in perpetuity, I would think that is even more argument for making it easy to change.

Ehh...sure. Sometimes it's evil. If the intention is that that type is an opaque handle, then I think wrapping it is not just OK, but preferred.

Actually, I think its precisely the case in which the opaque handle becomes a bit evil. The problem is that in order to pass that handle around safely, I either need to just assume that it isn't a pointer and always pass its address in my code (so passing a pointer to a pointer for little reason except that it is opaque), or I need to peak under the opaque abstraction to find out that it is a pointer. It can also cause problems with const, if I remember correctly.

If you are tied to the idea of a typedef containing a pointer, what projects typically do is adopt a naming convention to relate that fact the typedef hides a pointer. Usually it is something along the lines of the typedef ending _ptr_t, or _p, something like that.

I think some of the contents of the plugin instance are public and others are private. The ldmsd_plugin_inst_t is the public part.

I think that would be an especially egregious case of hiding a pointer under a typedef then. It isn't opaque, so there is no reason for that to exist.

It seems to be to be bad practice as well to create structures and just have the first value in the struct be another struct, and assume that different users will access the pointer as different structs. That is about as clear as mud. At the very least we need a union here. And frankly it would be much, much better for the struct to just have a "void *data" in it to let the plugins put their state data there than to go with this approach of extending structs on the sly. That just isn't easy for anyone to read or understand.

I'm not even sure that we are guaranteed for that sort of struct extension to be safe under all compilers. If the compiler doesn't know that it is an implicit union, it might pack the structs differently.

tom95858 commented 4 years ago

Hi @morrone, I think we're talking past each other. It's not a union, it's the top part of a structure that is extended by software that inherits it, like this:

struct biffle { int common_a; double common_b; ... };

struct biffle_sub { struct biffle base; / my stuff / };

The interface passes around struct biffle *, the plugin implementations of biffle re-cast it as their biffle_sub.

This is not some union magic.

morrone commented 4 years ago

No, I do understand that. But I'm saying it isn't terribly good practice to do that. That practice is not clear or explicit to the reader of the code. A "void *data" in the struct is clear and explicit. A mentioned a union because that would at least be a way to explicitly combine two structs, extending one. Unions aren't magic, they are explicit. This is magic, because what is going on is not at all clear from the declarations, nor is it particularly well documented.

"void *data" is used in so much code out there because it is a clear and explicit expression of what is going on. Let's avoid reinventing the wheel. When we get overly clever, it just makes it more difficult to understand the code.

morrone commented 4 years ago

@tom95858 , I rather suspect that the way that this extends structs, in your biffle example, is not C standards compliant.

Doing this is C standards compliant:

struct biffle {
  int common_a;
  double common_b;
}

struct biffle_sub {
  int common_a;
  double common_b;
  /* my stuff */
}

The best I can figure is that what ldms is currently doing requires the -fms-extensions compiler flag for gcc and clang to make the compilers actually understand what is going on. So that is a non-standards compliant extension. And if that is true, then what LDMS currently does with casts to trick the compiler is truly the "magic" approach.

I think, however, that we should head towards standards compliance rather than adding that compiler flag.

tom95858 commented 4 years ago

Hi @morrone,

Perhaps I missing your point, but this is not non-standard C. This model is used extensively all over the Linux kernel and throughout LDMS. The base structure encapsulates the interface, and the provider specific bits follow the common interface.

sys/queue.h does this kind of thing too. The link structure is declared in the list element and the macros cast the list element using the provided data type along with offsetof to locate the link structure. Our red-black-tree does the same thing.

The compiler knows what the base structure is, and has no issue placing the structure in memory consistently; the field alignment rules are well defined in the standard for all data types.

In general, I prefer a specific type to void * because the compiler can check this kind of thing:

extern my_type_t my_func();
...
my_type_t v = my_func()

But rather than going back and forth on generalities, let's put specific proposals on the table for consideration.

morrone commented 4 years ago

Hi @morrone,

Perhaps I missing your point, but this is not non-standard C. This model is used extensively all over the Linux kernel and throughout LDMS. The base structure encapsulates the interface, and the provider specific bits follow the common interface.

sys/queue.h does this kind of thing too. The link structure is declared in the list element and the macros cast the list element using the provided data type along with offsetof to locate the link structure. Our red-black-tree does the same thing.

That isn't the same thing at all. That doesn't assume two different structures with dissimilar declarations will have the same layout. They don't assume that two different strutures will be type compatible.

The compiler knows what the base structure is, and has no issue placing the structure in memory consistently; the field alignment rules are well defined in the standard for all data types.

The C standard only guarantees that structures will be type compatible (laid out the same in memory the same), when the names and field types within the two structures are identical. By putting struct A inside struct B, the names of the variables inside B no longer match the names in A (because they go through the intermediate name that you gave the sub-struct-A). Thats where I think things may be departing from the standard C compliance.

As far as I know, to get guarantees beyond that (and not just rely on implementation details), one needs to use compiler macros or compiler command line options.

In general, I prefer a specific type to void * because the compiler can check this kind of thing:

extern my_type_t my_func();
...
my_type_t v = my_func()

But the LDMS plugins appear to cast their structures to void * before returning, which sort of undermines that argument.

Not to mention that when the data is a completely opaque ball of memory relevant only to the plugin, a void is exactly representative to the reset of the world what it is. Its just a pointer to some memory, we don't know or care what types are being used inside it. This idea of a "void data" in an API to allow the user to stash state is a very common idiom.

But rather than going back and forth on generalities, let's put specific proposals on the table for consideration.

I think I was specific earlier, but maybe some examples would help. The ultimate proposal is something like the following:

lmds_plugin_t ldms_sampler_plugin_type_get()
{
  return LDMS_PLUGIN_SAMPLER;
}

/* First call into a new instance of this plugin. Pass in the configuration options.
    Returns a pointer to the new instance's state (or NULL on error) */
void *ldms_sampler_plugin_create(ldms_sampler_conf_t *arguments) {
  /* parse the arguments, create schemas and metric sets as appropriate, store all internal state in a new struct 
     that will be returned as a void pointer. */
}

/* destroys the state returned by ldms_sampler_plugin_create() */
void ldms_sampler_plugin_destroy(void *state) {}

/* The daemon calls this when approproate to tell the plugin instance to rescan its environment
   And adjust its metrics sets to what it finds. Some plugins will do nothing here. Others deal with
   things that come and go, and this could be the place for them to refresh. Just an idea. */
int ldms_sampler_plugin_refresh(void *state) {}

/* sample */
int ldms_sampler_plugin_sample(void *state) {}

This may very well not incorporate some new ideas from the master branch, but I'm pretty sure that it could.

oceandlr commented 3 years ago

@tom95858 Up above is the comment that multi instance plugins might be back ported to OVIS-4. Has that happened or is it in the cards anytime soon? Multi-instance for stores would be particularly beneficial as there is a lot of logic around handling that all in one store and might also make some performance tuning easier. (Adding @valleydlr )