LDMS lustre support - Githubissues

morrone commented 5 years ago

We would like to see LDMS updated to work with the recent generations of Lustre. I don't know the full extent of what we'll need yet, but I figure that it is worth while to begin the conversation with what we have found so far.

In the past Lustre made most of its stats available through entries in /proc. At the time, that was pretty much the only option available under Linux. These days Linux provides /sys and debugfs. Lustre has been migrating from /proc to a combination of /sys and debugfs in recent major releases.

Rather than hard-code the various possible locations for the same information into something like LDMS, the recommended approach is to use the command "lctl get_param" to retrieve lustre values. For instance, in Lustre 2.8 the lustre version can be found in /proc/fs/lustre/version, but in 2.12 the version is found in /sys/fs/lustre/version. And who knows when exactly that transition happened.

The following command works in both versions:

quartz1$ lctl get_param version
version=lustre: 2.8.2_6.chaos
kernel: patchless_client
build:  2.8.2_6.chaos

opal1$ lctl get_param version
version=2.12.0_1.chaos

Note, however, that the "version" output did change at some point between 2.8 and 2.12.

In 2.12, many, but not all, of the files named "stats" have moved from /proc to debug fs. Of most immediate interest to us are the 'llite.*.stats' files. The lctl command unfortunately fails to find any of the entries that moved to debugfs, but let's assume that is a Lustre bug that will be fixed.

Next, LDMS seems to be collecting a great many values that probably don't have a great deal of interest to us. This isn't to say that someone, somewhere might not find them useful, but we would like to focus initially on gathering a smaller subset of information. I will break them down by Lustre node type.

On Lustre clients, we want to gather the information from the following commands:

lctl get_param version
lctl get_param jobid_name
lctl get_param jobid_var
lctl get_param 'llite.*.stats'

On Lustre MDS nodes we will want to gather:

lctl get_param version
lctl get_param 'mdt.*.md_stats'
lctl get_param 'mdt.*.job_stats'

On Lustre OSS nodes we will want to gather:

lctl get_param version
lctl get_param 'obdfilter.*.stats'
lctl get_param 'obdfilter.*.job_stats'

I can follow up with the details of what the output of each of these looks like.

morrone commented 5 years ago

The 'lctl get_param' command offers a couple of command line options that might make it easier to use from a program.

Here is the default output from the command:

lctl get_param 'llite.*.stats'

(I cut out many of the lines to make it more readable here each place you see [cut])

llite.ls1-ffff900dc5ae6800.stats=
snapshot_time             1550708127.963344 secs.usecs
read_bytes                32165924 samples [bytes] 1 52076544 702265596390
write_bytes               41662304 samples [bytes] 1 45015040 584881704533
[cut]
llite.lsh-ffff900dea997000.stats=
snapshot_time             1550708127.963544 secs.usecs
read_bytes                53008802 samples [bytes] 1 20611072 2475218690081
write_bytes               89288296 samples [bytes] 1 607523328 1099190079546
[cut]

We can see that the "*" matched two filesystems in this example. We could simply parse this output as-is. But if we found it easier, we might choose to first list the keys that matches the pattern using "-N":

$ lctl get_param -N 'llite.*.stats'      
llite.ls1-ffff900dc5ae6800.stats
llite.lsh-ffff900dea997000.stats

And then we can follow up with individual lctl get_param commands for each key, and use "-n" to omit printing out the key that we already know:

$ lctl get_param -n llite.ls1-ffff900dc5ae6800.stats
snapshot_time             1550708450.764176 secs.usecs
read_bytes                32165928 samples [bytes] 1 52076544 702265598670
write_bytes               41662374 samples [bytes] 1 45015040 584881704718
[cut]
$ lctl get_param -n llite.lsh-ffff900dea997000.stats
snapshot_time             1550708461.987164 secs.usecs
read_bytes                53012485 samples [bytes] 1 20611072 2475245832397
write_bytes               89567767 samples [bytes] 1 607523328 1102279173709
[cut]

baallan commented 5 years ago

Hi Chris, @morrone thanks for the detailed input. To better document exactly what you want and where we can reliably get it in production lustre configurations (where we probably should not count on debugfs), please run script -c 'strace -e trace=open,read lctl {the args needed to get all the things you want}' lustre.trace.1.txt and attach the outputs to this issue. Multiple runs (and output file names) may be needed depending on the lctl syntax.

We are aware that the bulk of metrics from lustre are moving from /proc/sys to /sys per linux kernel standards. We do not yet have a patch that will update the existing lustre samplers to automatically adapt to which ever flavor of lustre they find themselves deployed with.

In the meantime, in the next release of ldms, there is a new sampler called 'filesingle' which can be configured to pull out arbitrary sysfs metrics including lustre based on a site specific configuration files. This is handy for most lustre metrics as well as temperatures, voltages, fan speeds, and many other items of interest that do not currently have corresponding samplers.

morrone commented 5 years ago

Hi @baallan. It sounds like you are trying to avoid using lctl. Let's start there.

Lustre isn't just moving from /proc to /sys. It has also moved to debugfs for some of its most important metrics. For instance, the 'llite.*.stats' metrics that I listed above are vital for our collection requirements and they now live in debugfs in Lustre 2.12.

debugfs is in use on production systems here; it is not just used in testing environments. debugfs does have an issue that it is root-only access by default, which is obviously a problem. One solution in the works to address that issue is to make the Lustre information that is unsuitable for /sys available through netlink. But again, if we use lctl, that will all (in theory) be opaque to us and let us use the same command to look up a key regardless of the backend implementation.

The 'filesingle' sounds nice and will be usable for some things in Lustre, but it will never cover some of the most important Lustre stats, because those stats cannot be easily represented in /sys. Even something seemingly simple like "lctl get_param version" looks in at least two different locations, and the output can be of two different forms.

Granted, it would be better if Lustre offered an API version of the lctl get_param/list_param commands. But it doesn't.

Is using lctl a deal breaker?

baallan commented 5 years ago

The ldms sampler plugin model could force-fit a system() call to lctl, but that's really not the way we like to do things. The 'L' in ldms is lightweight, but starting up a new process and loading libraries for every sample is decidedly likely to introduce application performance jitter at sampling frequencies in the neighborhood of 1Hz.

Traditionally, we take the utilities like ibstat and lctl and refactor them into an iterable plugin code. The strace approach bypasses the often awful top-down source code reading of layered utilities to discover what we're up against in terms of data source file locations and formats.

If there's a reasonable likelihood of stability in the lustre debugfs file formats (e.g. have they just split what used to /proc into /sys/fs/lustre and debugfs with some extensions?) then we can get the same data or new data from these kernel file systems without the lctl overheads. Creating a luster210 or lustre212 plugin should be a mere matter of code if the debugfs and sysfs locations polled are stable.

Our worst case scenario, when all else fails, is to have a kernel module sampler that has direct access to the internal metrics instead of going through fs apis. But to be truly maintainable, that would really need to become part of the lustre code base.

The root access requirement for some metrics isn't a problem; infiniband and omnipath counters have similar restrictions. It would be useful to identify what lustre metrics can be got without root and allow users to sample those without root. A separate plugin, or separate plugin configuration option, would allow the root user to get all available metrics.

So if you have access to the various versions of lustre of evolving interest and can assemble the straces of the queries of interest, we can get a better estimate of what needs doing.

morrone commented 5 years ago

We would probably prefer not to have multiple plugins to configure. Unless perhaps they could all be active at the same time, and off the same exact metrics. We just want to have Lustre support enabled and have it work regardless of which of Lustre is installed.

I will start posting the underlying paths and file contents now.

baallan commented 5 years ago

The single-plugin route would be to have an option controlling what is collected; it would default to collect 'everything' and non-root users would set the option to collect only the public bits.

morrone commented 5 years ago

The following is all for Lustre 2.8. Later I will follow up with changes for other versions.

Lustre version:

$ cat /proc/fs/lustre/version
lustre: 2.8.2_6.chaos
kernel: patchless_client
build:  2.8.2_6.chaos

Note that the only thing that the only value from this output that we actually wish to record is the value following "lustre:", in this case "2.8.2_6.chaos". It is especially reasonable to avoid recording the other two values because "kernel:" and "build:" are historical and have been removed from the output in later Lustre versions.

I do not think that it is necessary for most people to poll the version very frequently. For our purposes, recording the version once per hour is likely more than enough. Once per day might even be reasonable.

Next, jobid_name and jobid_var. These are both simple values with single-line string values (possibly empty):

$ cat /proc/fs/lustre/jobid_name
$ cat /proc/fs/lustre/jobid_var  
CENTER_JOB_ID

Next the llite metrics are found in paths following this basic general pattern:

/proc/fs/lustre/llite/*-*/stats

Examples:

/proc/fs/lustre/llite/ls1-ffff900dc5ae6800/stats
/proc/fs/lustre/llite/lsh-ffff900dea997000/stats

If LDMS does not do this already, I think we would want it to break up the directory name on the "-" (minus sign) and store the strings on either side as seperate values in the metric. The string before the minus sign is the file system's name, and the hexidecimal string after the minus sign identifies the particular instance of llite mounting the filesystem on this node.

Here is an example of the output from llite.*.stats:

$ cat /proc/fs/lustre/llite/ls1-ffff900dc5ae6800/stats
snapshot_time             1551296803.852516 secs.usecs
read_bytes                103002937 samples [bytes] 1 52076544 1725294659241
write_bytes               116376210 samples [bytes] 1 48603136 1612018314394
ioctl                     73879 samples [regs]
open                      198332 samples [regs]
close                     198332 samples [regs]
mmap                      6 samples [regs]
seek                      285340029 samples [regs]
fsync                     1038 samples [regs]
readdir                   6867 samples [regs]
setattr                   3575 samples [regs]
truncate                  13506 samples [regs]
flock                     1708 samples [regs]
getattr                   610950 samples [regs]
unlink                    62883 samples [regs]
symlink                   332 samples [regs]
mkdir                     1885 samples [regs]
rmdir                     627 samples [regs]
rename                    744 samples [regs]
statfs                    40106 samples [regs]
alloc_inode               117193 samples [regs]
setxattr                  7 samples [regs]
getxattr                  117781955 samples [regs]
getxattr_hits             8 samples [regs]
removexattr               14774 samples [regs]
inode_permission          8932747 samples [regs]

Note that this list is dynamic. This is not necessarily the exhaustively list of all possible values, nor will all of the values always be present.

morrone commented 5 years ago

Next, some MDS stats (again for Lustre 2.8):

The md_stats path pattern looks like:

/proc/fs/lustre/mdt/*-*/md_stats

Example:

/proc/fs/lustre/mdt/ls1-MDT0000/md_stats

This time there is no need to break up the "ls1" from the "MDT0000". The full "ls1-MDT0000" directory name is a unique identifier for the mdt, and can be stored along with the metrics gathered in the md_stats file.

The contents of md_stats look like:

$ cat /proc/fs/lustre/mdt/ls1-MDT0000/md_stats
snapshot_time             1551297474.287642 secs.usecs
open                      1677848510 samples [reqs]
close                     1663771686 samples [reqs]
mknod                     15275959 samples [reqs]
unlink                    162532 samples [reqs]
mkdir                     893715 samples [reqs]
rmdir                     1222 samples [reqs]
rename                    14078053 samples [reqs]
getattr                   92881694 samples [reqs]
setattr                   69249178 samples [reqs]
getxattr                  550515 samples [reqs]
setxattr                  481 samples [reqs]
statfs                    16118673 samples [reqs]
sync                      148 samples [reqs]
samedir_rename            14075412 samples [reqs]
crossdir_rename           2641 samples [reqs]

Again, like the stats from llite, this list is almost certainly dynamic. The values in the example may not be exhaustive, and they may not always exist when polled. (FYI, they probably only go away when the stats are cleared or when the service has been restarted).

Next I'll cover job_stats in another comment.

baallan commented 5 years ago

The stats files we currently collect, or at least an expected subset of the possibly appearing metrics. Are not the ffff900dea997000 (instance) and version identifiers completely redundant with log file entries from lustre daemon activities? Looking forward finding out where these files finally land in updated versions of lustre.

morrone commented 5 years ago

The "ffff900dea997000" is unique to (i.e. different on) every single mount on every single node. I'm not real clear on how LDMS presents these numbers, but including this value with the associated md_stats is pretty much a necessity if all possible configurations are going to be addressed. For instance, lets say that there is a filesystem named "ls1" on the server side. A single node might decided to first mount ls1 at /mnt/foo, and then also mount ls1 at /mnt/bar. In /proc, we would see something like:

/proc/fs/lustre/llite/ls1-ffff900dea997000/stats
/proc/fs/lustre/llite/ls1-ffff87880753dba0/stats

We might want to be able to tell one mount's stats from the other's. Granted, it is probably not the most common use case. But it has certainly happened around here accidentally, and we might want to be able to tell when it happens in the data.

These can probably not be easily correlated from log data.

The /proc/fs/lustre/version might be available from log data, but often the log data is going to be a completely different data collection path, and may or may not be easily to correlate. I suppose making the collection of /proc/fs/lustre/version optional would be reasonable if people do not wish to use it.

morrone commented 5 years ago

job_stats are a newer creation in Lustre and come with a different format. job_stats are output in YAML.

The path pattern for mdt job stats looks like:

/proc/fs/lustre/mdt/*/job_stats

Example:

/proc/fs/lustre/mdt/ls1-MDT0000/job_stats

And here is an example of the output (truncated to conserve space here):

$ cat /proc/fs/lustre/mdt/ls1-MDT0000/job_stats
job_stats:
- job_id:          lfs.56591
  snapshot_time:   1551301208
  open:            { samples:           0, unit:  reqs }
  close:           { samples:           0, unit:  reqs }
  mknod:           { samples:           0, unit:  reqs }
  link:            { samples:           0, unit:  reqs }
  unlink:          { samples:           0, unit:  reqs }
  mkdir:           { samples:           0, unit:  reqs }
  rmdir:           { samples:           0, unit:  reqs }
  rename:          { samples:           0, unit:  reqs }
  getattr:         { samples:       23903, unit:  reqs }
  setattr:         { samples:           0, unit:  reqs }
  getxattr:        { samples:         218, unit:  reqs }
  setxattr:        { samples:           0, unit:  reqs }
  statfs:          { samples:       26249, unit:  reqs }
  sync:            { samples:           0, unit:  reqs }
  samedir_rename:  { samples:           0, unit:  reqs }
  crossdir_rename: { samples:           0, unit:  reqs }
- job_id:          df.43665
  snapshot_time:   1551301202
  open:            { samples:           0, unit:  reqs }
  close:           { samples:           0, unit:  reqs }
  mknod:           { samples:           0, unit:  reqs }
  link:            { samples:           0, unit:  reqs }
  unlink:          { samples:           0, unit:  reqs }
  mkdir:           { samples:           0, unit:  reqs }
  rmdir:           { samples:           0, unit:  reqs }
  rename:          { samples:           0, unit:  reqs }
  getattr:         { samples:        7232, unit:  reqs }
  setattr:         { samples:           0, unit:  reqs }
  getxattr:        { samples:         250, unit:  reqs }
  setxattr:        { samples:           0, unit:  reqs }
  statfs:          { samples:        7487, unit:  reqs }
  sync:            { samples:           0, unit:  reqs }
  samedir_rename:  { samples:           0, unit:  reqs }
  crossdir_rename: { samples:           0, unit:  reqs }

Fortunately, it appears that the "snapshot_time" stays fixed between multiple reads of the file if the data for that particular job_id has not changed. LDMS could potentially cache each job_id's snapshot_time and only dump new data when the time changes.

Each job_id's stats remain for a configurable period of time after the last change in it's data, the current default being 600 seconds. The period of time is found in the following file:

$ cat /proc/fs/lustre/mdt/ls1-MDT0000/job_cleanup_interval
600

Unlike the previous "stats" file, the "job_stats" file lists zeroes so there is some hope that all of the possible values are listed in the example. Of course, the list may be different on other versions of Lustre.

baallan commented 5 years ago

re mount point instance numbers: the present plugin for lustre probably would fail (I haven't checked code) this use case of the same thing mounted on different points. The current plugin configuration requires the user to specify the mounts to monitor e.g. fscratch,gscratch and then the plugin looks for lustre/llite/gscratch- and lustre/llite/fscratch- expecting the match to be unique in each case. The plugin does not at all handle the case of mounts being dynamic. Up to now we have treated the instance number as irrelevant.

baallan commented 5 years ago

re job stats, isn't there a lustre pipeline to moving those stats out? Wouldn't you rather have the raw stats scanned at an interesting frequency instead of this job_stats summary file?

morrone commented 5 years ago

re job stats, isn't there a lustre pipeline to moving those stats out?

Yes?

Wouldn't you rather have the raw stats scanned at an interesting frequency instead of this job_stats summary file?

No, I don't think so. This is not a summary of the entire job, this is a breakout of the raw stats seen by that particular server component that were performed by RPCs tagged with a particular jobid. It is much, much easier to answer many common questions about who is doing exactly what to any particular server component using these stats.

Assembling the same information from raw stats on the client side would require orders of magnitude more data collection and computation, with probably little additional value. It also requires our client side data collection to be perfect, whereas we are far more likely get comprehensive monitoring set up on the server side.

For instance, a common question would be: "who is beating up the third MDS in Lustre cluster X?". With job_stats collection, we can very quickly discover which user and/or job on which client cluster is involved. Without jobstats we first need to combine the metrics from many thousands of clients, and then follow up with a second effort to figure out who was the cause of the problem on that node.

The "who" isn't always as easy as it might seem. For instance, users have found ways to hammer servers from login nodes. Unless the the problem is still happening and we can catch the user in the act, we might still be in the dark as to the root of the problem. With jobstats, we could potentially have the user's uid or username as part of the lustre job_id string, so there we can easily associate exactly which user cause which operations on a particular server component.

morrone commented 5 years ago

re job stats, isn't there a lustre pipeline to moving those stats out?

Yes?

Or perhaps I should have said "no, that is what we are currently in the process of setting up, with LDMS as a planned key component".

morrone commented 5 years ago

The plugin does not at all handle the case of mounts being dynamic.

Can we change that? Mounts are, by their nature, some what dynamic. New filesystems are installed, filesystems go down for maintenance and are unmounted, etc. Changes are not frequent compared to LDMS possible sampling frequency, but they are not at all uncommon at human scale. Requiring a reconfiguration that is error prone and likely to be forgotten by humans can lead to lost data. It would probably be much easier to just record data for whatever is found on that node.

baallan commented 5 years ago

re dynamic mounts, this is possible, but has implications for downstream because the data schema changes for the sampler. The present case, the admin configures ldms to look for the maximum possible mount points. (e.g. fscratch, gscratch, hscratch). If a mount is missing, the plugin reports 0s until the mount reappears. This lets us have a fixed schema with columns approximately like lustre_bytes_written#gscratch, lustre_bytes_written#hscratch. In the dynamic case we would have to add a 'device_name' column and eliminate the per-mount-point column naming. Switching to the device-independent schema isn't a big deal in the sampler code, but someone at the storage and search layer of data processing has to be aware and filter accordingly.

baallan commented 5 years ago

re jobstats, that's a compelling argument if indeed the lustre community is not providing the piping needed to get the local data out to their dashboards.

The dirty little secret of ldms samplers is that if they're 'just reading files', they're pretty easy to write. As with the llnl edac sampler, for a more or less entirely new sampler that would meet all your specifications I'd be happy to jointly develop (I can provide the ldms knowledge/code if you can contribute the lustre specific bits). I cannot yet promise to do so because the project leads may prefer for a new plugin development to be implemented by open grid computing.

baallan commented 5 years ago

what about lnet router statistics? We have a plugin that parses lnet_stats, but there are other sources of numbers. or is job_stats 90% of the battle in understanding what lustre servers are up to?

morrone commented 5 years ago

re jobstats, that's a compelling argument if indeed the lustre community is not providing the piping needed to get the local data out to their dashboards.

Right, Lustre has no official dashboard or centralized control/monitoring. Getting the metrics off of the individual lustre nodes that use lustre in any way is almost entirely left to tools outside of Lustre itself.

In the dynamic case we would have to add a 'device_name' column and eliminate the per-mount-point column naming.

Yes, that is what I was thinking. I kind of envisioned that is how we would have the tables in the final database anyway (although I am no database expert at this point).

One way to think of it: Assume we have 3 lustre file systems, A, B, and C, and three client clusters, X, Y, and Z. Cluster X mounts A and B, Y mounts B, and cluster Z mounts B and C. (This isn't actually very far from our production situation.)

In the current LDMS method we either need to have separate schemas for each cluster and then probably merge the resulting data together in the final central database (Cassandra in our case), or if we make one LDMS schema that is shared between all of clusters then each cluster is more-or-less permanently reporting zeros of clusters it may never actually have mounted.

what about lnet router statistics? We have a plugin that parses lnet_stats, but there are other sources of numbers. or is job_stats 90% of the battle in understanding what lustre servers are up to?

"lnet" is essentially the networking layer, so lnet metrics are somewhat orthogonal in concept to the higher level Lustre components like client, mdt, ost, etc. /proc/sys/lnet/stats is available and may be of use any nodes that run lustre, not just on nodes that have lnet routing enabled. So lnet stats from lnet routers do not tell us anything directly about what is happening on the server. It is sort of like watching the more general counters an ethernet switch. You know that data is moving, but you don't necessarily know what it is or where it is going.

In this first pass at setting up Lustre data collection, I am not worrying about lnet. It will probably be nice to have at some point in the future, but I think that it is a lower priority to us at the moment.

So yes, the combination of the mdt//md_stats, mdt//job_stats, obdfilter//stats, obdfilter//job_stats probably gets us 90% of what we care about for the servers.

morrone commented 5 years ago

As with the llnl edac sampler, for a more or less entirely new sampler that would meet all your specifications I'd be happy to jointly develop (I can provide the ldms knowledge/code if you can contribute the lustre specific bits). I cannot yet promise to do so because the project leads may prefer for a new plugin development to be implemented by open grid computing.

Thanks, I appreciate your help! When do you think you will know whether you will be working on a new plugin? Folks around here are eager for a Lustre monitoring solution, so I might try my hand at some LDMS plugin code if you think you won't be able to get to it in the near term.

morrone commented 5 years ago

In the dynamic case we would have to add a 'device_name' column and eliminate the per-mount-point column naming.

Would this approach work with LDMS v3, or would we perhaps require something like the vector support from LDMS v4 to make this work? It is not immediately clear to me whether the aggregator can get back a variable number of "rows" in v3, but it sounds like v4's vectors might make that possible?

baallan commented 5 years ago

Re v3 vs v4 vectors (which also exist in v3) don't really help in the situation where devices come and go dynamically. If the sampler implementer makes an ldms 'set instance' for each device monitored, then the aggregator will pull all these sets and (in csv terms) each device will get its own row. for a given time stamp, there will be N rows if there are N devices and the device-label column will be a key.

The current lustre samplers make a wide row with column names suffixed with the device label (xmit_bytes#gscratch, xmit_bytes#fscratch), but this is an implementation choice from many years ago now.

baallan commented 5 years ago

Re scheduling/tasking, I will try to get some resolution on the work plan from Brandt/Gentile early next week. If you want to charge ahead and show what is possible and contribute the detective work on exactly which bits come from which files in the new layout of 2.10 (or each later version), it should be relatively simple to clone and adjust the existing lustre2_client to a lustre_210_client (assuming you know C). Switching from a wide data set to a multi-row scheme for dynamic devices is "ldms bits" that the ldms team can easily help with. Developing the parser for converting the job stats file to a useful C struct full of data is an easily carved out contribution.

Deciding how to do dynamic device add/remove discovery will get a little involved, but in the multi-row scheme it can be finessed by taking the list of possible device names and then simply not creating sets for devices which are currently missing. A periodic recheck of missing expected devices would be simple (no extra thread needed).

baallan commented 5 years ago

@valleydlr @oceandlr Chris at LLNL has been helping us accumulate the technical requirements/desirements for a lustre210_client sampler in the github issue tracker. We now need some work planning (who/when/what). It seems to be a highish priority for llnl and I can say the same data would certainly be very useful at SNL. It seems the lustre core tools include no monitoring, so it's a great chance to make ldms look good.

morrone commented 5 years ago

Maybe you can help me understand the semantics surrounding the LDMSD_PLUGIN_SAMPLER API. It looks like the config(), sample(), get_set(), and term() are the main callins to a sampler plugin (relevant to metric sets). Typically the scheme and metric set(s) are created in config(), the data (metrics) in a metric set are updated in sample(), and the metric set itself is returned in the get_set(). The metric set is destroyed in term().

So my main question is this: is there any safe time to change the metric set makeup (adding new metric, deleting metrics) other than at config() time? There are no function calls that would seem to tell the plugin when the caller is done using the set that the plugin returned in get_set(). I don't see any locking at first glance, so it is not clear to me that it is safe to add or remove metrics from a set while it is in use by the main program.

Is it reasonable to assume that the set is no longer in use externally while in sample()? The ldmstransaction[being|end] and ldms_set_is_consistent functions seem to imply that it is not safe. It implies that get_set() can be called (or the returned set pointer continued to be used) while sample() is being called.

How would we go about adding or removing metrics to overall metric set to allow dynamic device addition and removal?

tom95858 commented 5 years ago

Hi Chris,

Overall, it sounds like your understanding of the plugin API is largely correct. The get_set() function, however, is not used to retrieve the set or its contents. In fact, I don't think you'll find any callers of this function; it is a legacy left over from the very early days.

Sets are not typically accessed on the sampler node; they are made available for lookup by remote nodes (aggregators) that fetch and store the data. ldms_ls is often used to examine it as a user, but again this happens remotely over the ldms transport.

Metric sets should be updated in a transaction bounded by ldms_transaction_begin() and ldms_transaction_end(). While inside a transaction, the contents are considered "inconsistent" because they are being changed. If the set is fetched and the flags indicate it is inconsistent, the contents are disregarded and the data is fetched again.

In addition, generation numbers are updated whenever set data or meta data are changed so the client "knows" if the contents have changed. This is done principally because the meta-data is only fetched when it needs to be in order to minimize the network overhead required to remotely update a metric set. The meta-data are typically 5x or more the size of the metric values themselves. If the meta data are changed, the reader knows that it needs to refresh the meta-data, so these changes are handled "automatically" by the core LDMS logic.

If the schema itself is going to change, the set should be destroyed and recreated. This is because the client has to allocate and register memory suitable to receive data RDMA_READ from the producer. If the size of the set changes, then this needs to be redone. This step occurs in ldms_xprt_lookup(). Note that the client will be notified whenever sets are destroyed/created, so the client "knows" when sets come and go.

Other samplers that do this sort of thing, will create a separate set for each mount point. The set's instance name/and-or a field in the set data will indicate the mount point. Then as mount points come and go, so do the metric sets.

Tom

tom95858 commented 5 years ago

Hi Chris,

An example of a sampler that may do something similar to what you're after is ovis/ldms/src/sampler/cray_system_sampler/dvs_sampler.c

Tom

baallan commented 5 years ago

@tom95858 would the upcoming 4.x release be a good time to get rid of the deprecated get_set api? It seems to have been last used, if ever, in 1.x. I'd be happy to do all the needed deletions pronto.

morrone commented 5 years ago

My question was perhaps a little more basic. I am trying to understand how anything outside of the plugin itself ever gets access to the schema and metric set that the plugin creates. Aside from the unused get_set() API, there doesn't seem to be anything explicit that makes the plugin's information available to the rest of parent program that loaded the plugin. So I suppose that availability must happen as a side effect of some library calls that the plugin makes.

But I think that I have found my answer:

\brief Create a Metric set *

Create a metric set on the local host. The metric set is added to

the data base of metric sets exported by this host.

So a side effect of ldms_set_new() is that the set is also added to a local data base of sets.

The dvs_sampler.c file does look like a good example, thank you!

tom95858 commented 5 years ago

Hi Chris,

You are correct. However, there is a newer/preferred API that allows you to set access rights. It is called ldms_set_new_with_auth(). This creates the set and makes the set visible to local clients (i.e. running in the daemon. Local clients can "find" a set by calling ldms_set_by_name().

In order to make the set visible externally (i.e. ldms_xprt_lookup()), you must call ldms_set_publish()

Tom

tom95858 commented 5 years ago

@baallan the plugin interface is being refactored and will be in master shortly.

baallan commented 5 years ago

@oceandlr @valleydlr SNL capviz expects to be going to the new lustre with April TOSS updates, so we will be lustre-dataless there until new lustre samplers are available.

morrone commented 5 years ago

Hi folks, it looks like I am hitting a problem described in issue #7.

With Lustre job_stats, I am creating metrics sets dynamically on demand, and also removing them when no longer needed. I would like to be able to create metrics sets again that have names that were already used, but ovis v3 doesn't seem to support that. The ldms.h file says for ldms_set_delete(): "The set will be deleted when all set references are released." I am not sure that this is really the case, because a new metric set with the same name can't be created for hours after the previous incarnation of the metric set was deleted.

Will ovis v4 make is possible to really delete metric sets? I see that there are new ldms_set_publish()/ldms_set_unpublish() functions and in issue #7 there was some talk of adding those functions and then allowing ldms_set_delete() to really delete the metric set.

baallan commented 5 years ago

When a set with the same name is recreated, is there some reason the new set would not end up having exactly the same metrics as the previous incarnation? Alternatively, could you not encode relevant identifiers in the set data and just use a series of numeric names for the similar sets? Then if a set is 'retired', you unpublish it and stick it in a cache for reuse; this would take care of the problem of a name that disappears forever.

With v3 we ran/run into a situation for multiple samplers where (on a compute node) all of /proc can simply disappear [due to linux bugs] and then reappear. It was an excellent thing, overall, that neither the local collector nor the aggregator deleted the set.

morrone commented 5 years ago

When a set with the same name is recreated, is there some reason the new set would not end up having exactly the same metrics as the previous incarnation?

I am not sure that I understand the question. Since the metric set will employ the same schema it will have the same metrics available. The values stored in some of those metrics will be different because the values are sampled at a different point in time.

Alternatively, could you not encode relevant identifiers in the set data and just use a series of numeric names for the similar sets? Then if a set is 'retired', you unpublish it and stick it in a cache for reuse; this would take care of the problem of a name that disappears forever.

I could implement a cache (sacrificing useful names for the metric sets as you suggest). But that is a not-insignificant amount of additional complexity that is not required by the task at hand (except as a workaround for a delete() function that doesn't delete).

If ldlms_set_delete() will never delete by design choice, then shouldn't such a cache be implemented by the infrastructure rather than making each plugin (with dynamic data) reimplement the same caching behavior?

With v3 we ran/run into a situation for multiple samplers where (on a compute node) all of /proc can simply disappear [due to linux bugs] and then reappear. It was an excellent thing, overall, that neither the local collector nor the aggregator deleted the set.

For the lustre use case, I'm not sure that I see any issue with the set disappearing if/when the source of data disappears. In fact, we definitely want that to happen in some normal situations (an ost is "failed over" to another server node). Distinguishing between normal disappearance and the less likely buggy accidental disappearance of /proc (due to linux bugs) would be unnecessary complexity in our use case.

We still strongly desire an ldms_set_delete() that really, fully deletes the set.

oceandlr commented 5 years ago

When you say a "a new set with the same name can't be created" do you mean you do not see the new set in the ldms_ls output? or in the store?

morrone commented 5 years ago

We chatted in person this morning, but just to close the loop: ldms_set_new() returns an error when one tries to use a name that matches one that was deleted in the past.

morrone commented 5 years ago

@baallan , can you share what information SNL uses from the current lustre plugins? I wrote new ost and mdt plugins, and I'm about to start on a new client one. I plan to collect far less data than the previous plugin, just focusing on the values that are most likely of value. If I know which values SNL depends on, I might be able to incorporate them.

baallan commented 5 years ago

@morrone When things go wrong we would look at all of:

alloc_inode brw_read brw_write close create dirty_pages_hits dirty_pages_misses flock fsync getattr getxattr inode_permission ioctl link listxattr mkdir mknod mmap open osc_read osc_write read_bytes readdir removexattr rename rmdir seek setattr setxattr statfs status symlink truncate unlink write_bytes

On a routine basis we will look at (for stories of user/cache behavior) in lustre2_client:

close create dirty_pages_hits dirty_pages_misses fsync mkdir mknod open read_bytes rmdir statfs unlink write_bytes

If there were a counter akin to nfs numcalls for lustre MDS requests, we'd take that too.

If there were a sampler option "minimal" or some such, we would put the non-routine items in the non-minimal metrics for optionally enabled collection.

We don't presently run ldmsd on the lustre MDS and OS* servers, so I have no commentary on them.

For the lnet_stats plugin (or equivalent) we will look at everything. Our general approach is that anything which is a hardware error counter is always included.

Ldms includes features to suppress output to storage of metrics one doesn't want. We tend to collect everything in sight by default unless it's expensive (wallclock) to get the counter or an enormous set of counters of low value except in profiling (e.g. per core metrics on KNL nodes). Then the site can define storage policy.

morrone commented 5 years ago

Great! With the exception of "status", "osc_read", and "oscwrite", those are all basic llite stats that I am planning to include. "status" is an artificial construct of the current plugin's design. osc{read/write} don't exist in llite's stats as of lustre 2.8.

Do you only use the llite data? The mdc/osc data is what I have in mind to completely drop in my plugin. With collection from the servers, it seems like mdc/osc data will probably be rarely used.

I don't have any plan to touch lnet. Likely the old plugin still works.

morrone commented 5 years ago

We are going with the plugins at https://github.com/LLNL/ldms-plugins-llnl.

ovis-hpc / ldms

LDMS lustre support #20