intel / ledmon

Enclosure LED Utilities
GNU General Public License v2.0
72 stars 49 forks source link

Slots list introduction #132

Closed ktanska closed 1 year ago

ktanska commented 1 year ago

Introduction of new sysfs list - list of slots, which are available for all controllers supported now (VMD, NPEM, SCSI). Use list to execute slot commands.

mtkaczyk commented 1 year ago

@tasleson could you please do some testing with enclosures?

Thanks, Mariusz

tasleson commented 1 year ago

@tasleson could you please do some testing with enclosures?

Yes, I'll test the SES enclosure paths

Quick testing so far showing the following:

Compile warnings

ledctl.c: In function ‘find_slot’:
ledctl.c:683:13: warning: the comparison will always evaluate as ‘true’ for the address of ‘device’ will never be NULL [-Waddress]
  683 |         if (slot_req->device && slot_req->device[0] != '\0')
      |             ^~~~~~~~
ledctl.c:126:14: note: ‘device’ declared here
  126 |         char device[PATH_MAX];
      |              ^~~~~~
ledctl.c:685:18: warning: the comparison will always evaluate as ‘true’ for the address of ‘slot’ will never be NULL [-Waddress]
  685 |         else if (slot_req->slot && slot_req->slot[0] != '\0')
      |                  ^~~~~~~~
ledctl.c:131:14: note: ‘slot’ declared here
  131 |         char slot[PATH_MAX];
      |              ^~~~

SES is running into issues

$ sudo ./ledctl --list-controllers
ledctl: SCSI: Invalid slot identifier index P�x
ledctl: SCSI: Invalid slot identifier index ��z
/sys/devices/pci0000:00/0000:00:1f.2 (AHCI)
/sys/devices/pci0000:00/0000:00:01.0/0000:09:00.0 (SCSI)
/sys/devices/pci0000:80/0000:80:02.0/0000:81:00.0 (SCSI)
/sys/devices/pci0000:00/0000:00:11.4 (AHCI)
tasleson commented 1 year ago

@mtkaczyk This PR breaks SES considerably, the changes to correct will be non-trivial. Let me know if you want me to supply changes for this PR or submit a new PR after this change is integrated into the repository.

ktanska commented 1 year ago

I'm working on implementation, and unfortunately I haven't got hardware to check SES functionality. I would appreciate if you can find root cause of list-controllers issue and add fix to this PR. For VMD controllers list there aren't errors like these. I will work on the first issue, I haven't receive this error. Probably I use another gcc version or build flags.

tasleson commented 1 year ago

I'm working on implementation, and unfortunately I haven't got hardware to check SES functionality. I would appreciate if you can find root cause of list-controllers issue and add fix to this PR. For VMD controllers list there aren't errors like these. I will work on the first issue, I haven't receive this error. Probably I use another gcc version or build flags.

I used gcc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-4) with no changes to build flags.

Please elaborate on what problems/enhancements this PR addresses. I'm also trying to understand the benefit of replicating the following information for every slot:

    /**
     * Controller type which is being represented by slot.
     */
    enum cntrl_type cntrl_type;

    /**
     * Pointer to the set slot function.
     */
    status_t (*set_slot_fn)(void *slot, enum ibpi_pattern state);

    /**
     * Pointer to the get led state function.
     */
    enum ibpi_pattern (*get_state_fn)(void *slot);

For my test configuration, this results in 96 copies of the same values.

ktanska commented 1 year ago

We came up with the conclusion that instead of filtering paths and devices, it will be simpler to use list and more efficient to pass list element every time. That allowed us to remove multiple similar branches across code for every single controller and command and indeed that works for vmd and npem. Moving slot commands pointers and controller_type to slot_property was intended to have a clarity, which struct is responsible for the triggered slot operation at given stage. I planned for the slot_request to only have parameters given by the user. This data is validated and used to find the correct slot_property from the list. Then only proper slot_property element is used to continue work on given command. I'm open to change way of keeping this repeated information, but I would like to not take them back into the slot_request. I will think about new solution, after fixing reported errors.

tasleson commented 1 year ago

@mtkaczyk @ktanska The current code

static ledctl_status_code_t list_slots(struct slot_request *slot_req)
{
    ledctl_status_code_t status = LEDCTL_STATUS_SUCCESS;

    switch (slot_req->cntrl) {
    case CNTRL_TYPE_VMD:
    {
        struct pci_slot *slot;

        list_for_each(sysfs_get_pci_slots(), slot)
            status = get_state_for_slot(slot->sysfs_path, slot_req);
        return status;
    }
    case CNTRL_TYPE_NPEM:
    {
        struct cntrl_device *ctrl_dev;

        list_for_each(sysfs_get_cntrl_devices(), ctrl_dev) {
            if (ctrl_dev->cntrl_type != CNTRL_TYPE_NPEM)
                continue;
            status = get_state_for_slot(ctrl_dev->sysfs_path, slot_req);
        }
        return status;
    }
    case CNTRL_TYPE_SCSI:
    {
        struct enclosure_device *encl = NULL;
        char slot_id[PATH_MAX];
        list_for_each(sysfs_get_enclosure_devices(), encl) {
            for (int i = 0; i < encl->slots_count; i++) {
                snprintf(slot_id, PATH_MAX, "%s/%d", encl->dev_path, encl->slots[i].index);
                status = get_state_for_slot(slot_id, slot_req);
            }
        }
        return status;
    }
    default:
        return LEDCTL_STATUS_NOT_SUPPORTED;
    }
}

Does not translate to this:

static void _scan_slots(void)
{
    struct pci_slot *pci_slot;
    struct cntrl_device *cntrl_device;

    list_for_each(sysfs_get_cntrl_devices(), cntrl_device) {
        // now NPEM and SCSI are supported
        if (cntrl_device->cntrl_type == CNTRL_TYPE_NPEM ||
            cntrl_device->cntrl_type == CNTRL_TYPE_SCSI)
            _slots_add(cntrl_device, cntrl_device->cntrl_type);
    }
    list_for_each(sysfs_get_pci_slots(), pci_slot) {
        _slots_add(pci_slot, CNTRL_TYPE_VMD);
    }
}

You're passing a cntrl_device pointer to a function enclosure_slot_property_init that is casting it to a struct enclosure_device *encl . Using a void * as a function parameter is preventing the compiler from helping prevent such errors. Furthermore, even if you pass the correct pointer to enclosure_slot_property_init you're missing the slot index for the slot you're initializing. This is what's currently encoding the index into the device path.

snprintf(slot_id, PATH_MAX, "%s/%d", encl->dev_path, encl->slots[i].index);

This PR is expecting to pass encl->dev_path everywhere assuming it has this, but it doesn't.

Also this PR does not account for multiple slots for an enclosure. I'm not sure how you want to address this and make it common across the different hardware.

What's the advantage of passing void * when you simply cast to a structure pointer to get access to a path variable? Why not continue to pass the path value instead?

mtkaczyk commented 1 year ago

The conception was to make a list of unique slots, as represent in --list-slots, if for SES this is encl->dev_path + encl->slots[i].index, the following entries will be generated and added to list. The list should represent slots in system whatever they are. I don't see a conceptional issue here yet. Do I miss something?

There are multiple goals to achieve by this PR, not related to SES (which is let's say "original" solution with no conflicts with other controller types). I would like list the motivators to follow this way:

  1. cmdline<->application isolation- at some point user input is compared with internal list element, and later list element is used instead.
  2. Bug fix - for npem and vmd there could be conflicts (#126 ) the chance was made to resolve it
  3. Avoid revered logic, for list_slot we are generating full list of slots, we want to do it same for get and set. Just get the generated list, find an element and that is it. No need to take path and start to search for slot by comparing lists- this make implementation worse and more bug friendly (and indeed we have a bug with VMD and NPEM now).
  4. Less code complexity, you can see that in PR summary, we removed more lines than added. It is good trend.
  5. Better error handling- we simply assume that whatever list provided is certain and additional validation on that is not needed. As a result we are able to isolate cmdline errors like improper slot really easy and quick.

I totally understand your point and agree and as a result application will occupy more memory. I'm also open to optimize the duplicated entries from struct slot_property. We can also discuss if we need to generate full list every time (because we are asking for particular controller everytime).

I think that in longer perspective it is simpler and better maintainable solution and that have big value for me.

tasleson commented 1 year ago

What kind of testing was done before this PR was submitted?

This looks incorrect and should fail if a user calls ledctl and doesn't use the slot functionality.

int npem_write(struct block_device *device, enum ibpi_pattern ibpi)
{
    struct cntrl_device *npem_cntrl = device->cntrl;

    if (ibpi == device->ibpi_prev)
        return STATUS_SUCCESS;

    if (ibpi < IBPI_PATTERN_NORMAL || ibpi > IBPI_PATTERN_LOCATE_OFF)
        return STATUS_INVALID_STATE;

    return npem_set_slot(npem_cntrl->sysfs_path, ibpi);
}
status_t npem_set_slot(void *slot, enum ibpi_pattern state)
{
    struct pci_dev *pdev = NULL;
    struct pci_access *pacc = get_pci_access();
    status_t status = STATUS_SUCCESS;
    const struct ibpi2value *ibpi2val;
    struct cntrl_device *npem_cntrl = (struct cntrl_device *)slot;

      ...

npem_write is calling npem_set_slot which is passing a sysfs path and npem_set_slot is casting it to a struct cntrl_device

Again void * usage is preventing compiler from helping here, and I'll reiterate what is the advantage of passing the struct cntrl_device * when the code simply references npem_cntrl->sysfs_path?

mtkaczyk commented 1 year ago

What kind of testing was done before this PR was submitted?

We are trying to change our model to opensource so only manual testing was done for now. First full review must be done and that is why I asked you to take a look especially on SES side (because we cannot validate it) Thanks for catching this!

Again void * usage is preventing compiler from helping here

Yeah, that is the risk we've taken but such nits like here are easy to determine.

I'll reiterate what is the advantage of passing the struct cntrl_device * when the code simply references npem_cntrl->sysfs_path?

There is no advantage because it is not a problem. In first case you need to refer->sysfs_path, in second case you have it as is. The difference is that you don't need to cast it from void. I see that. For NPEM and VMD all we need is a path but for SES we need struct enclosure and we need to determine ses_slot. That makes casting from void unavoidable for me. If we resolve initial wrong casting issues the rest should go smoothly. We can left is as now and search on lists for enclosure again, but it is overhead now because we generate full slot list first.

We can add some signature at the beginning of every void casted structure to determine if we are casting it correctly but I don't see a big risk here because decision is made based on cntrl_type and that is set by controller internal _init function.

tasleson commented 1 year ago

@ktanska @mtkaczyk I have some ideas for this PR which I think will improve the implementation a bit and address some of the concerns I have. I'll try to get something together by end of day tomorrow.

*Update: I have the initial changes done. SES is back to working and I removed the `void ` usage and removed the redundant data in each slot in the refactored slots API. I want to do a few more changes and do some more code review of the paths I cannot test to ensure I corrected them without introducing new ones.**

Where the change is currently sitting at, but I think I want to add a couple helper functions for ledctl to use instead of having it call the slot callback functions directly.

$ git diff HEAD~1 --stat
 src/enclosure.c | 74 ++++++++++++++++++++------------------------------------------------------
 src/enclosure.h |  6 +++---
 src/ledctl.c    | 10 +++++-----
 src/npem.c      | 44 ++++++++++++++++++++++++--------------------
 src/npem.h      |  6 +++---
 src/pci_slot.c  | 27 +++++++++++++--------------
 src/pci_slot.h  |  6 +++---
 src/slot.c      | 26 +++-----------------------
 src/slot.h      | 68 +++++++++++++++++++++++++++++++++++---------------------------------
 src/sysfs.c     | 35 +++++++++++++++++++++--------------
 10 files changed, 130 insertions(+), 172 deletions(-)
mtkaczyk commented 1 year ago

@tasleson

@ktanska @mtkaczyk I have some ideas for this PR which I think will improve the implementation a bit and address some of the concerns I have. I'll try to get something together by end of day tomorrow.

Cool! Monday is holiday in Poland so we will take a look on Tuesday. It gives you more time, no need to hurry, I'm impressed by you engagement, really appreciate that! That is great example of opensource power :)

Thanks, Mariusz

ktanska commented 1 year ago

Hi, thank you for your engagement. I really appreciate your help. For now, I will abstain from pushing changes to avoid merge conflicts. I will come back on Tuesday and continue working on this solution, and will be able to test your changes for vmd and npem. Thanks, Kinga

ktanska commented 1 year ago

Work is being continued here: #133

ktanska commented 1 year ago

Implementation was continued on #133 which was merged already. Closing this pr.