TLC: SPOF, current workaround and feature requests

phobos-storage / phobos

This repository holds the source code for Phobos, a Parallel Heterogeneous Object Store.

GNU Lesser General Public License v2.1

3 stars 2 forks source link

TLC: SPOF, current workaround and feature requests #7

Open thiell opened 1 month ago

thiell commented 1 month ago

Opening this for discussion and to provide some feedback. The introduction of the TLC is changing the way we manage phobos and adds complexity we did not originally expect.

Before:

each data movers could access our SpectraLogic tape library using a separate dev changer device via ADI as needed (one can be configured per drive)
a full drive failure with ADI did not impact the whole system (it only affected the phobosd instances accessing it)
a specific phobosd instance was not able to automatically switch to another lib_device (that limitation is still present with the TLC, see below)

Now, with the TLC being mandatory at this time, we are facing new challenges:

only a single TLC instance can be accessed from all data movers at a specific time, because of caching involved at the TLC level
it looks like the TLC configuration in phobos.conf only supports a single changer device, like phobosd before, but it is even more problematic now.

Our goal is to not have a single point of failure when accessing the library. We currently cannot have multiple TLCs accessed at the same time by multiple phobosd because of the local caching performed by the TLC. Our current workaround to a TLC SPOF is to set up a virtual IP using keepalived.service on a set of multiple data movers each running a TLC service, like this:

/etc/keepalived/keepalived.conf:

! Configuration File for keepalived

vrrp_instance TLC_VI_1 {
    state BACKUP
    interface eno8303
    virtual_router_id 51
    priority 255
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass ...
    }
    virtual_ipaddress {
        10.2.0.199/24
    }
    notify_master ...
    notify_backup ...
    notify_fault ...
}

With elm-ent-tlc set to 10.2.0.199, we then use the following in phobos.conf to configure the tlc service:

[tlc]
hostname = elm-ent-tlc
port = 20123
lib_device = /dev/sch0

However:

(1) because of the caching that each TLC daemon is doing, we cannot easily migrate the TLC IP to another data mover live. But at least it should work in case of the crash of the server running the TLC: the virtual IP immediately moves to another TLC which can answer requests. Having a way to disable all caching done by the TLC would be welcome here.
(2) correct me if I am wrong but a specific TLC instance seems to only supports a single lib_device, so if we loose access to the changer device used by the TLC, the TLC does not work anymore and all phobosd will be stuck. As a workaround to this, we configure each TLC service with a different lib_device and have a local daemon testing the tlc, but it's cumbersome. It would be nice if the TLC could manage multiple changer devices and switch to a working one automatically.

Hope that makes sense, let me know what you think!

thiell commented 1 month ago

Ok, I might have misunderstood how the TLC works... Our workaround to use a virtual IP using keepalived doesn't seem to work. hostname in the [tlc] configuration section seems to be used to specify the listening address, not the IP that phobosd will use to connect to the TLC. Can phobosd connect to a TLC running on a remote server at all? What's the recommendation here if we have multiple data movers running phobosd? Is that to run one tlc per phobosd on localhost, and let them all access the same lib_device? Thanks!

patlucas commented 1 month ago

Thanks Stéphane for opening this discussion.

We fully understand your concern about the fact that the TLC is introducing a SPOF in Phobos architecture. The TLC is the result of a trade off between :

on the bad side, the fact it is introducing a SPOF,
on the good side, it solves Library problems against concurrent parallel commands and it speeds up scsi commands by caching coherent Library state. (One of the Library model that we use here at CEA was unable to receive parallel concurrent commands. That is the main reason of the TLC development.)

You are right on the fact that we currently run the TLC instance against only one lib_device, defined in the "[tlc]" section of the phobos conf through the "lib_device=" parameter. We register your feature ask to have a TLC able to manage not only one lib_device but a list of several ones in order to be able to move to a next one when it faces an error on the current one. We will discuss soon inside the dev team this feature ask to schedule it between Phobos 3.0 or Phobos 4.0 . It was your point "(2)" .

Concerning your point "(1)", I guess I have correctly understood your multi-tlc architecture using virtual IPs. There is currently no way to disable all caching done by one TLC but there is a way to force a TLC to reload its internal state through a phobos admin command. So in your failover mechanism, you can try to add this command to refresh the cache of the resilient TLC instance before it receives new connections.

To try to answer your second message, the option "hostname=" and "port=" of the "[tlc]" section of the phobos configuration file are both used by the TLC server and the admin commands or phobosd daemons that want to connect to it. The TLC uses it to bind its listening socket. The admin commands or phobosd daemons use it to connect to the TLC socket. If you have multiple servers running one phobosd, our recommendation is that all phobosd connect the same single TLC instance.

But we currently have no mechanism to reconnect our client/server sockets if we face a connection problem. So, when you switch from one TLC to an other through the same virtual IP, I guess that the client will not try to reconnect to the new TLC. Making our socket communication layer be resilient against server restart is already a feature that we plan to develop in the future but it is not currently available.

martinetd commented 1 month ago

Hi Patrice!

I've got a bit of time available to work for Stéphane, would you mind if I send a patch to allow splitting the hostname= and port= settings?

Specifically, hostname= and setting= would keep their current meanings, still used in PHOCFG{ADMIN,LIB_SCSI}_tlc_hostname and PHO_CFG_TLC_hostname, but if listen_hostname and listen_port are set these would get priority only in src/tlc/tlc.c (so code-wise, add a new [tlc] listen_hostname/port with a NULL/-1 default, and if NULL/-1 was returned do a second PHO_CFG_GET with the current name -- it's just a few lines, just checking it makes sense with you before sending the patch) (EDIT: since I had basically written the patch describing this I pushed this as WIP: https://review.gerrithub.io/c/cea-hpc/phobos/+/1194809 -- I'll test & eventually update over the weekend, but any alternative/design-level comment it welcome and nothing is set in stone)

I'll also have a quick look at how hard to implement both (2) [for a given tlc, try multiple changer paths on error] and the last point you raised [for a given phobos client, try to ask another TLC if something went wrong]; but I'm not sure I'll be able to contribute productively so it might end at just looking at the code.

patlucas commented 1 month ago

Hi Dominique,

Because the Phobos architecture basically contains only one TLC, it was coherent to have only one parameter in the conf to both address the listening and connecting address of the TLC. But it's true that by adding the virtual IP layer between the client and the server, Stephane introduces the need to differentiate the listening address ("real one") of the TLC from the connecting address ("virtual one") of the clients (phobosd or admin commands). In this very specific case, the "listen_hostname" and "listen_port" that you suggest to add seem to be a good solution. Thank's a lot for this contribution. We will iterate other your patch on gerrithub.

martinetd commented 1 month ago

Thanks Patrice for the reviews and integrating the patch in your CI!

Long term we might want to look for a better way to collaborate:

There's old patches on gerrithub that never made it to master, I've just closed a four year old gerrit patch that was implemented differently, and I had forgotten about my 3 months old autotools ax_python script update so just redid the patch recently (now abandonned) ; I assume you cannot import patches automatically but having a script that syncs them manually after review might help?
conversely some patches spend a couple of weeks on your gerrit (comparing author/commit date), so if someone outside of CEA find a bug and work on fixing it we can never know if it's already been fixed in review; I'm not sure what can be done about this...

There's a few other open commits it'd be great if someone had time to check, but nothing in a hurry: https://review.gerrithub.io/q/project:cea-hpc/phobos+is:open - should I cc you (or Guillaume or anyone else on the team) when sending new patches or do you have notifications set up already?

But anyway, that's off topic, if there's anything I can do to help with that let's open a new issue to discuss but I'm not sure much can be done.

Back on track: the listen_hostname/port works, but binding on 0.0.0.0 without any restriction means someone could connect over the data network unless firewall rules are also added; it's possible for the tlc to bind to a specific interface with setsockopt (SO_BINDTODEVICE); I've tentatively added an extra listen_interface setting (tested to work): https://review.gerrithub.io/c/cea-hpc/phobos/+/1194917 / https://review.gerrithub.io/c/cea-hpc/phobos/+/1194918 ;

I've also started looking at the "if the changer gives an error retry on a different changer" code, and the scsi abstraction you've made in src/tlc/scsi should work quite well for this! So I think I could whip this up if you're interested, but it looks like you were planning for parallelism and not just retries, so my design might not be appropriate and I'd rather get feedback first. Basically, for parallelism there'd be quite a lot of work (probably creating one thread per mover as there's no async ioctl, then adding a work queue they'd feed on, then sending processed requests back somehow (and putting back failed work on the queue...) But if we only care about retry it seems fairly straightforward:

make [tlc]'s lib_device config item a comma separated list of paths like the other lists we have
right now tlc_library_open takes a device path, but it's always immediately preceded by a config get, so it'd make more sense to move the config get there...
at which point it can use strok_r to parse the list as we do elsewhere (or we could add a helper for that)
for each name try to open and keep a list of fd we managed to get, error if no fd was obtained (thinking linked list so we can reorder it easily)
change scsi calls to not take a lib->fd, but the list of fd; it'd always use the first item of the list in this single thread implementation
the scsi code already has a retry loop for all (most?) calls, where we classify errors are fatal retry short etc etc... will need to check, but most fatal calls would be converted to "retry next changer"; then the retry callback would just move the head of the list to point to the next drive so it could keep retrying. The tricky part will be to detect when we've tried all the drives to abort, perhaps have a dummy 'end of list' node to detect, that'd be kept in position while we're retrying (so if we encounter the dummy node we're done), but ensured to be last at the start (or end) of the loop

If you prefer to have parallelism right away I can give it a try along the lines of the short description I gave but I'm not sure I'll be able to finish quickly; or if you'd rather it's implemented in house for phobos 3 or 4 later and don't want me to touch this part of the code please tell me as well.

patlucas commented 1 month ago

About the way we can collaborate, we wait a local CI cluster and forge opened to outside contribution for next year. You may directly collaborate on it. Awaiting this solution, the simplest way is maybe to close the gerrithub repo and to use only github for issues and pull-request. We can review and iterate other the phobos-storage/phobos github pull-requests. When they are ready, we can test and integrate them on our local CI cluster and then push them back to github master .

For phobos, gerrithub is maybe an unnecessary additionnal tool, that is linked to the former github/cea-hpc/phobos repo.

martinetd commented 1 month ago

About the way we can collaborate, we wait a local CI cluster and forge opened to outside contribution for next year. You may directly collaborate on it.

That will be great!

Awaiting this solution, the simplest way is maybe to close the gerrithub repo and to use only github for issues and pull-request. We can review and iterate other the phobos-storage/phobos github pull-requests. When they are ready, we can test and integrate them on our local CI cluster and then push them back to github master

Sure, I can re-open my changes as github PRs.

I'll look if I can "close" the gerrithub repo somehow (if there's no option we can modify the config to forbid new changes to be opened)

patlucas commented 1 month ago

About adding the resilient feature to the TLC, there is effectively a "conflict" between trying to use several dev/changer and making the TLC able to run several commands in parallel. We originally plan to add the "parallel" feature to phobos 3.0 . We need to discuss we the dev team if we switch parallelism and resilience, or if we developp the both ...

I was initially thinking to let the TLC deals with the list of dev/changer and not the SCSI sub layer... The key point is to detect when a device is declared failed (definitlty failed ...remove from the list of device to use ...)

martinetd commented 1 month ago

I was initially thinking to let the TLC deals with the list of dev/changer and not the SCSI sub layer... The key point is to detect when a device is declared failed (definitlty failed ...remove from the list of device to use ...)

Since there is no async ioctl we'll need a thread per device; at which point each thread can decide for themselves if the device is good or not: for example regularly issue a no-op request, does opcode = TEST_UNIT_READY work for example? or if not just READ_ELEMENT_STATUS on something that ought to exist) and only accept work if status or recent commands were ok (and if we had a fatal error somewhere, re-check with status to decide if it was a changer device failure or not?)

With such a mode, I think if you do parallelism you'll "automatically" get resilience as well. If it's planned soon I shouldn't work on "just retries". I can have a stab at parallelism right away, I can probably get something to work-ish but I'm not sure I'll have much time for test (especially since I only have a VTL and I'm not sure quadstor implements multiple changers...) -- but once again I don't want to impose myself, if it's planned "soon" I can leave it to someone who wants to work on it.

patlucas commented 1 month ago

You can work on it ! :-)) Any help is welcome !

Concerning threads, do you imagine "permanent" thread ? We design phobosd around the idea of having one thread per device to achieve parallelism but out feedback on this design is not fully positive. We are reluctant to use again this design. We were maybe now more open to a design were we use thread only by spawning them per task ... It's open !

martinetd commented 1 month ago

Hmm, for regular read/writes I agree we now have better async APIs and it's better not to spawn a thread for every drive, but for ioctls there really is no way around thread as far as I'm aware. At this point having a pool of thread for requests and spawning more if more requests come is possible, but for changers I'm not sure it's worth it? Would a configuration with more than e.g. 4 changers make sense?

courrierg commented 1 month ago

Hello,

Like Patrice, I would prefer to avoid the "one thread per device" architecture. The TLC is simpler than phobosd but after the parallelization of phobosd, I don't want to repeat this mistake. :) One complication that I see is the management of the state of the library across threads. I wrote a prototype some time ago to implement an async task API for the parallelization of the TLC. If this is needed quickly, I can probably try to integrate this in the TLC. Most of the design for the parallelization has been done. This may not be too long to implement.

I agree with you on the fact that async ioctls don't work (or at least IIRC only some ioctl can support async operations but not the SG_IO one for sure). So there is no way around threads.

As for the number of changers, it depends on the library. It seems that IBM libraries handle parallelization through multiple changers (which is why we had issues when using only one changer). You can setup up to one per drive if you want. Then you can send one request per changer concurrently. So a big library with 40+ drives can have 40+ changers. Or at least, this is what we have understood after running some tests.

martinetd commented 1 month ago

One complication that I see is the management of the state of the library across threads.

Yes, internal TLC state will be the main problem -- I was thinking of making threads only for scsi calls that don't use the internal state at all, and keeping a single thread to process client requests and scsi replies once they're in, so this wouldn't require any new lock.

So a big library with 40+ drives can have 40+ changers

That's a bit surprising (I don't see how it'd make sense to have more changers than robot arms at least, but I could picture a very big library with 40 arms); but even then 40 idle "scsi threads" shouldn't be much of a problem in my opinion. What we should avoid is having one thread per TCP client (or worse pending request); it'd be cleaner to have a polling model where requests are processed as they come, and scsi replies unqueued as they become available. This avoids both the need for lock and it also avoids spawning an unbounded number of threads.

I haven't seen the phobos parallelization process but I know all too well how messy that can get, if you have a prototype I'll be happy to let you finish and help on review/tests instead -- it doesn't have to follow the model I described, I can take some time to check data accesses if that was your plan.

courrierg commented 1 month ago

Yes, my example with 40 changers is probably a bit too much. But some libraries (I don't know about IBM ones specifically) can also move several tapes with one arm. So in practice you can load more tapes than arms available in parallel. But this would still be a small number I agree. One changer per drive is probably only useful for resiliency.

I also agree on the one thread per request/connection. For phobosd, we only have one thread that handles all the requests and it didn't show any scaling issues. This thread is using epoll internally to manage client connections/messages. This is also what the TLC does even though it will have a much smaller number of connections than phobosd in practice.

courrierg commented 4 weeks ago

A little update. I have integrated my prototype in the TLC. The request management is now async but only one request at a time for now. You can easily build on this to add the management of a list of library paths. I think that having a list instead of a simple file descriptor as you mentioned in struct lib_descriptor is probably the simplest way of implementing this feature. And will allow the use of concurrent requests after.

For now, my code isn't public since it needs to go through reviews and testing. I don't know if you intend to work on this soon? Otherwise, there is also the locate at put feature mentioned in the issue #10 that can be developed if you are interested. Let us know what would help you to go forward.

martinetd commented 4 weeks ago

Thank you, great news!

From what I can tell Stanford doesn't seem to have performance problems (especially now we're redirecting requests on specific movers per tag there is very little tape movement; there are a couple of dozens of coordinatool patches on gerrit if you have time to review...), so having concurrent requests isn't a priority, but failover would be nice for safety.

I'm sure it can wait a few weeks though, let's have it go through internal review. I can also help with reviews if you push it to a branch on github or somewhere I can see & comment, I think github allows comments even if there's no PR, but it's a bit of a shame as I might comment on something you got internal comments on duplicating work so it's probably best if I just look after it's merged & public.

For #10 we're still interested in some QoS, I'm taking a short break but will comment on that issue again soonish to discuss fair share - we also had some weird behaviour with best fit picking I need to take time to reproduce & investigate... Well, there are things to do, don't worry :)