Closed krehm closed 6 years ago
We'll take a look at this. I think this code path is part of an older configuration interface that is being deprecated. The new interface doesn't have limits on reply sizes, and by extension error messages.
In looking at the master branch at gitlab@gitlab.opengridcomputing.com:ovis/ovis.git the problem appears to still exist. Perhaps the newer interface is in one of the other branches? (I'd like to see how it works.)
The fix should be merged to the master branch this week. Here is the merge request link. https://gitlab.opengridcomputing.com/ovis/ovis/merge_requests/545
I changed the signature of such functions that receive fixed-size string and receive char * err_buff and size_t err_buff_size. If the error message is larger than the err_buff_size, more memory will be allocated.
Change of plan. The fix that will be merged is https://gitlab.opengridcomputing.com/ovis/ovis/merge_requests/557
snprintf is used instead of sprintf to prevent buffer overflow.
The interface change patch isn't ready yet. It should be ready for review in a week or two. The interface protocol will use header that specifies the message length. The message header definition will be similar to the following.
typedef struct ldmsd_req_hdr_s {
uint32_t marker; /* Always has the value 0xff */
uint32_t flags; /* EOM==1 means this is the last record for this message */
uint32_t msg_no; /* Unique for each request */
uint32_t cmd_id; /* The unique command id */
uint32_t rec_len; /* Record length in bytes including this header */
} *ldmsd_req_hdr_t;
Due to the way that Cray packages products, the pathname to dynamically loaded samplers and stores can be quite long. If a sampler is not found, this results in an error message that is larger than the current 128-byte size of err_str[LEN_ERRSTR].
"-1dlerror /opt/cray/ldms/0.1-1.0000.292e049.0.0.el7/lib64/ovis-ldms/libcray_procdiskstats.so: cannot open shared object file: No such file or directory"
The result is a ldmsd daemon abort(). I doubled the size of LEN_ERRSTR to 256 in our local copy of OVIS, and this seems to solve the problem.
Program received signal SIGABRT, Aborted. 0x00007f43131851d7 in raise () from /lib64/libc.so.6 (gdb) where
0 0x00007f43131851d7 in raise () from /lib64/libc.so.6
1 0x00007f43131868c8 in abort () from /lib64/libc.so.6
2 0x00007f43131c4f07 in __libc_message () from /lib64/libc.so.6
3 0x00007f431325f047 in __fortify_fail () from /lib64/libc.so.6
4 0x00007f431325f010 in __stack_chk_fail () from /lib64/libc.so.6
5 0x000000000040a166 in process_load_plugin (
...