opensvc / multipath-tools

Other
59 stars 47 forks source link

Failed to send SCSI Registration to anyone of LUNs after mapping a large number of extern LUNS. #45

Closed boposki closed 1 year ago

boposki commented 1 year ago

Procedure:

Step1: 1024 LUNs (16 paths) for an external storage are mapping to the host, then run rescan-scsi-bus.sh to produce 1024 disk. Step2: Manually issues a registration command to one of LUNs, receive timeout error. mpathpersist -o -I -S 0x000000003320095c /dev/dm-117 But if I sent a registration command with sg_persist -o -I -S 0x000000003320095c /dev/dm-117, that was successful. According the error log, I found that mpathpersist send msg of saving prkey to multipathd timeout when I config reservation_key:

defaults {
    path_checker            tur
    no_path_retry           18
    path_grouping_policy    group_by_prio
    prio                    const
    deferred_remove         yes
    uid_attribute           "ID_SERIAL"
    reassign_maps           no
    failback                immediate
    log_checker_err         once
    reservation_key         "file"  // this item
}

Root Cause: The recv package cannot be recievd after fixed 4 seconds timeout, because multipathd spent more than 4 seconds to excute PARSE, which triggers vector lock collision with checkerloop.

#define DEFAULT_REPLY_TIMEOUT   4000
static int do_update_pr(char *alias, char *arg)
{
        ......
    condlog (2, "%s: pr message=%s", alias, str);
    if (send_packet(fd, str) != 0) {
        condlog(2, "%s: message=%s send error=%d", alias, str, errno);
        mpath_disconnect(fd);
        return -1;
    }
    ret = recv_packet(fd, &reply, DEFAULT_REPLY_TIMEOUT);
    if (ret < 0) {
        condlog(2, "%s: message=%s recv error=%d", alias, str, errno);
        ret = -1;
    }
       ......
}

Solution Suggestion: Modify client timeout to uxsock_timeout value rather than DEFAULT_REPLY_TIMEOUT , that will be consistent with server, and that would make more sense: Client wait timeout should be more than Server excecution Timeout,

considering the transmission delay. After that, uxsock_timeout in /etc/multipath.conf can be modified to more than default value such as 10 seconds.

mwilck commented 1 year ago

I don't understand.

multipathd spent more than 4 seconds to excute PARSE

what does this mean? What do you mean with PARSE, and how is it possible that it took 4 seconds? Can you fix this by simply increasing the timeout?

Btw which multipath-tools version were you using?