RFC: fair_share_max per tag

phobos-storage / phobos

This repository holds the source code for Phobos, a Parallel Heterogeneous Object Store.

GNU Lesser General Public License v2.1

3 stars 2 forks source link

RFC: fair_share_max per tag #10

Open thiell opened 6 months ago

thiell commented 6 months ago

I believe we would greatly benefit from having limits of (local) drives that can be use but per specific tag.

Similarly to:

# Maximum number of LTO9 drives for read, write and formats respectively
fair_share_LTO9_max = 4,4,4

...but by tag :)

Note: for fair_share by tag, format would not be relevant, only drives for read and write.

Our use case is as follow: we have tapes with different/disjoint tags (for example: n0, n1, n2, n3) , and we are archiving files to all of them at the same time. We would like to have something like

# fair_share_max[<tag>] = <max_drives_read>,<max_drives_write>
fair_share_max[n0] = 1,1
fair_share_max[n1] = 1,1
fair_share_max[n2] = 1,1
fair_share_max[n3] = 1,1

So that for any tape mounted containing tag "n0", 3 other drives could remain available for the other tags. What do you think?

courrierg commented 6 months ago

Hello @thiell,

This sounds like an interesting feature. It also looks like something quite complex to implement. The fair share was implemented with 3 groups (read, write, format). What you are asking for is to extent this algorithm to an arbitrary number of groups (i.e. tags). This means that the current implementation of the fair share would need a lot of rework.

One of the things that may be a bit complex is the relationship between the fair share algorithm and the I/O schedulers. For the current fair share algorithm, we have a one-to-one mapping between the groups of requests and the schedulers. With your use case, this is a bit more complex. The issue is that the current I/O schedulers don't really care which drive gets the request. The fair share algorithm makes the repartitions between the schedulers and then the schedulers are free to use their drives as they want.

Which means that your use case would imply to have a new fair share in each I/O scheduler. Or something like this.

Could you describe the use case for this? Do you want:

to have a "hard coded" association between drives and tags (i.e. stored in the DB)? In other words, ensure that some tapes are only ever used by a given set of drives. This could be because some users paid for a given set of drives for example.
to have a more dynamic association (not hard coded but only in the configuration). This means that which drive gets associated to a tag will vary from reboot to reboot. It may also vary from tape to tape meaning that we could dynamically switch a tag from one drive to another during the uptime of phobosd.
to ensure some sort of QoS between users/groups? In that case, this QoS mechanism could be implemented at another level in phobosd. We could add this before the requests reach the I/O schedulers and the fair share algorithm for example. Which is not ideal since less requests in the I/O schedulers mean less opportunity for optimization. But this would perhaps be simpler to implement. Another solution would be to add those QoS mecanisms in the coordinatool. This is something that we plan on adding at some point (originally this year but our schedule may not allow it).

The other drawback of having a QoS before the current fair share is that we may miss some optimization opportunities in the grouped read algorithm. If a read is not sent to the scheduler because the user has reached his quota, we may have to do an additional reload which is arguably worse in terms of performance. So the QoS could actually have a non negligible impact on the overall performance.

Since this feature seems quite complex as you described it, it would be interesting to know the actual use case to see if another simpler solution exists.

One last question, would you still use the current fair share algorithm with this new one? Or just the new one?

thiell commented 6 months ago

Hi @courrierg!

Thanks for your answer. I think all 3 points that you described could be valid in practice! :)

A site might want to assign drives to be used only with specific tapes, like you said, because some users paid for drives and tapes. However, in my specific case, we do not need this to be strictly enforced. One could argue that it could be implemented already by using multiple Phobos DBs.
Your second point describes a situation close to the first one, but more dynamic by assigning tapes to drives using tags. It could definitely be a valid and useful use case. For our immediate use case however, we do not need to assign specific drives to tags, we're more looking at QoS.
Your last point is interesting and it makes sense to me to look at the coordinatool to implement something like that.

Now, I've done more testing and realized something that is now our main issue: there is no phobos_locate() done by the coordinatool for archive requests. This is very problematic for us with 4 data movers and 16 drives, because the combination of tags (up to 16, but more likely 8 in practice) that we have is higher than the number of drives per data movers (4). When we archive using different tags, the "tagged" hsm archive requests are randomly sent to a data mover by the coordinatool, which may or may not have a tape loaded matching the specified tag list, triggering a lot of tape movements.

I'm working with @martinetd to see how best we could add a locate feature by tags in the coordinatool. What we would like to achieve is this: if a tape is already mounted/in use in a drive by phobos, and its tags match the tags of an archive request, I would like the corresponding data mover (that has the drive and tape) to be used for this archive request.

I imagine this might not be how all phobos sites want to operate (if you have fewer tags, it might be best to use the maximum number of tapes in parallel, even with the same tags). But in my case, I want to minimize tape movement and regroup the writes with the same tag on the already mounted tape as much as possible. Perhaps we could do (1) locate-by-tags-on-archive + (2) fair share per tag at the coordinatool level.

And yes, I think the current fair share algorithm by tape model should be kept (basically it's already implementing some association between tapes and drives, the "tag" here being the model, eg. "LTO9"). In any case, I would still use that.

martinetd commented 6 months ago

Yes I think the coordinatool could easily get the list of mounted tapes/tag to check if a mover already has a valid candidate mounted; but if you want a fair share we'll need to do this at phobos level too as I think that if you send a request to a copytool it'll probably try to use any free drive that is locally accessible, so it'll probably mount more tapes if there are idle drives even if a candidate is already mounted (the copytool sees all the drives on the mover? I need to finish setting up the full Lustre/HSM chain to confirm this..)

So we need both:

coordinatool needs to query phobos on archive too, not just restore, to get any candidate that's already mounted (it doesn't have to be 1, could round-robin between n candidates and pick a free drive if there is less than n candidates mounted)
eventually need a way to tell the phobos copytool to wait for a specific drive, and not try to remount another tape that could be used for something else shortly afterwards

As first approximation though if you have enough nodes there are quick & ugly ways of doing this, e.g. dedicating a node for a tag for archives; that'll probably greatly limit the number of mounts/umounts you see.

thiell commented 6 months ago

To provide more context, we use Robinhood v3 that runs an archive policy and use tags extracted from the action_params of the target_fileclass, so our tags depend on the path of the file on the filesystem (we have standardized this part for our needs: risk classification, projects and minio erasure coding stripe index). Example below. I am not sure how exactly I could dedicate a data mover node by tag, unless I make use of archive_id perhaps, and run multiple coordinatools (one per archive_id), each targeting a specific data mover. 🤯 I've also been thinking running smaller hsm archive policies at the robinhood level, targeting fewer tags at the same time, to avoid tape movement, but that also has some other issues. That's why I think we definitely need a configurable locate by tag for archive.

FileClass mr {
    definition { tree == "/elm/*/mr" }
}
...
FileClass p-test {
    definition { tree == "/elm/*/*/projects/test" }
}
...
FileClass minio_n0 {
    definition { tree == "/elm/*/*/*/*/minio/n0" }
}
...
FileClass mr_test_minio_n0 {
    definition { mr inter p-test inter minio_n0 }
    lhsm_archive_action_params {
        risk = mr;                                           <<< tag passed to phobos when archiving
        project = p-test;                                    <<< tag passed to phobos when archiving
        minio_n = n0;                                        <<< tag passed to phobos when archiving
    }
}

This is our Robinhood v3 HSM archive policy rule, with up to 16 combinations of tags:

lhsm_archive_rules {
    ignore_fileclass = system;
    # Do not archive MinIO system/temporary files
    ignore_fileclass = miniosys;

    # Common archive rule for minio files
    rule archive_minio {
        #
        # Policy rule that targets specific classes - they must be disjoint!
        # Matching action params will be automatically resolved and passed
        # to the action command as Phobos tags.
        # IMPORTANT: make sure the tags are pre-defined in Phobos!
        # 
        # mr/p-test
        target_fileclass = mr_test_minio_n0;
        target_fileclass = mr_test_minio_n1;
        target_fileclass = mr_test_minio_n2;
        target_fileclass = mr_test_minio_n3;
        # mr/p-test2
        target_fileclass = mr_test2_minio_n0;
        target_fileclass = mr_test2_minio_n1;
        target_fileclass = mr_test2_minio_n2;
        target_fileclass = mr_test2_minio_n3;
        # mr/hnc
        target_fileclass = mr_hnc_minio_n0;
        target_fileclass = mr_hnc_minio_n1;
        target_fileclass = mr_hnc_minio_n2;
        target_fileclass = mr_hnc_minio_n3;
        # mr/campus
        target_fileclass = mr_campus_minio_n0;
        target_fileclass = mr_campus_minio_n1;
        target_fileclass = mr_campus_minio_n2;
        target_fileclass = mr_campus_minio_n3;

        # Archive to Phobos with tags
        action = cmd("lfs hsm_archive --archive {archive_id} --data 'tag={risk},tag={project},tag={minio_n}' {fullpath}");

        # Common policy rule condition
        condition { size > 0 and last_mod >= 12h }
    }
}

martinetd commented 6 months ago

unless I make use of archive_id perhaps, and run multiple coordinatools (one per archive_id), each targeting a specific data mover

coordinatool accepts all archive ids by default, so this actually isn't that complicated (would need to start the final phobos copytools with one archive id per class and have robinhood target that), but that'd also make the mover restricted for restores and give less flexibility for the future (afaiu, once a file is stored with an archive id, the restores must use it)

I was thinking just adding a map describing if keyword x is found in the --data string then use mover y for achival; that's very easy to implement and doesn't require talking to phobos at all.

But once again, asking phobos where something using tag x is already mounted probably isn't that difficult; I think if we just implement it that way we'll end up in a similar situation. I'll check a bit more during this week.

courrierg commented 6 months ago

Yes I think the coordinatool could easily get the list of mounted tapes/tag to check if a mover already has a valid candidate mounted;

This is what phobos_locate does for gets. We planned on adding this for the puts as well.

but if you want a fair share we'll need to do this at phobos level too as I think that if you send a request to a copytool it'll probably try to use any free drive that is locally accessible, so it'll probably mount more tapes if there are idle drives even if a candidate is already mounted (the copytool sees all the drives on the mover?

Yes, this is true. The only I/O scheduler for writes is FIFO which tries to use all the drives available. If you want other scheduling strategies, we need a new I/O scheduler for writes. We also planned on having a better scheduler for writes. We should definitely discuss what your needs are.

So we need both:

* coordinatool needs to query phobos on archive too, not just restore, to get any candidate that's already mounted (it doesn't have to be 1, could round-robin between n candidates and pick a free drive if there is less than n candidates mounted)

We can easily extend phobos_locate for puts and have it return a list of hosts instead of just one. Then the coordinatool can pick the best one.

* eventually need a way to tell the phobos copytool to wait for a specific drive, and not try to remount another tape that could be used for something else shortly afterwards

This would be a new I/O scheduler.

The need for a tag aware locate at put and a new I/O scheduler for writes is something that is on our roadmap. This is not the first item of the roadmap though. :)

martinetd commented 5 months ago

Some loud thinking on this topic.

First, current requirements for Stéphane if I understood them correctly:

Files on lustre are created by minio and sharded (tag=n[0-3] in the config); this is for safety so tags must be on different tape
Above that, directories are split by project and risk, which are two orthogonal extra tags that also require their own tapes (I'm not sure what risk means in their phobos configuration but perhaps a level with raid1? and projects means files created by different entities where data cannot be mixed so that also require different tapes)

All in all, that's a lot of different populations, it's not surprising that sending requests at random generate a lot of tape mounts.

The current approach right now is:

the new coordinatool patches allow sending all archive requests for a n[0-3] tag to a given tape copytool
robinhood policy runs are scheduled sufficiently apart so that projects don't overlap

With this tape movements have been reduced nicely, but the policy scheduling is completely manual and will not scale. As soon as multiple policies overlap files for different projects are archived and phobos generates a lot of tape movement, which was the reason for this issue (original idea of limiting the number of drives for a project so if there's is a batch of requests for a given project all drives won't be remounted to that project). However, that idea means that if there is only a single project being archived most drives will stay idle, which isn't ideal either...

After discussing with Stéphane we think that can be better addressed by a higher level scheduling on the coordinatool: if the coordinatool sent an archive request for project X in the middle of a batch of Ys then phobos will have no choice but to process it anyway, but if we can force the coordinatool to do archives in batch then we can use all the (archival) resources on a copytool without issue. OTOH we want to make sure the requests get processed before they time out, so some limit to each batches is in order.

Given the other requirement for n[0-3], that means a two-level scheduling:

archives get queued in a different list per full hint (tag=riskA,tag=projectX,tag=n[0-3])
the queues themselves are in an ordered list
when a copytool requests work, it picks work from the first n[0-3] matching queue in that list
- the first time work is picked up on that queue, a timestamp is recorded
- after batch_duration, that queue is put back at the end of the list, and work can continue on the next list
- if new work comes in from lustre with a new tag, that also gets queued down at the end of the list, ensuring older work gets processed first as long as the work fits within batches (at first we discussed "deadlines", but there is no guarantee that there is enough bandwidth to process all requests in time before their deadline, and starting to mix things up will just make overall transfers slower -- the admin must ensure hsm.active_request_timeout is big enough to go through a few cycles of batches so requests get a few chances.

Doing something like that means that even if policies overlap, the mount operations will only happen when a batch times out, so phobos can spend more time writing. What do you think?

Meanwhile, for restore we're also not doing enough: we're currently doing a phobos locate when a requests comes in from lustre (if the tape is already mounted, send this request in priority there) and when we're scheduling a request to an agent (if it's mounted elsewhere don't send the request there but queue it in priority for that other agent), but that has two problems:

even if it's a priority list there's no guarantee the tape will still be mounted by the time it's sent
if a tape is mounted after the request has already been queued, the coordinatool won't know about it and these requests will stay scattered

To solve these problems I think the coordinatool shouldn't focus on movers, but sort requests by tape, so my current idea is pretty similar to the archive scheduler: more queues! one per tape!

queue requests per tape when they come in, the tape queues are ordered like for archives. Ideally the requests within a queue are sorted by position on the tape? Does that still make sense with current technos (and do we have the position on LTFS in phobos?)
when we start restoring from a tape assign it to a mover, mark the start timestamp and keep picking from the same n tapes (this must be configured: if we send only requests for a single tape then the mover won't be able to do other restores. Ideally we could ensure tags are different for QoS on restores, so a single project won't hog all drives, and given how minio works we also probably want to say that if tape for riskA,projectX,n0 is being mounted we want to priorize other riskA,projectX,n[1-3] tapes? That is probably overthinking? Will ignore that for now.)
After batch duration, move tape back; ideally the batch duration would be such that we never need to do that.

Ultimately there isn't much change needed in phobos here; for archives we have all we need directly from lustre so it's not going to be phobos dependant. For restores we'll need to get more details about each achive during the enriching phase, I'll check how much info we can get.

I'll probably be working with Stéphane on some of that these couple of months, feedback if welcome before I get to it :)

courrierg commented 5 months ago

After discussing with Stéphane we think that can be better addressed by a higher level scheduling on the coordinatool

One important thing to note is that Phobos' FIFO scheduler is not strictly FIFO. If a request cannot be handled right now, it will try the next one, then the next... So if you send two puts on tag p1 then one on tag p2, it is possible that Phobos will still do p1, p2 then p1 which would trigger a useless load/unload. This can easily be fixed with a flag in the configuration for example. But I think if you focus only on the coordinatool for the scheduling, you might have some issues with the current behavior of Phobos (which can be fixed of course).

if the coordinatool sent an archive request for project X in the middle of a batch of Ys then phobos will have no choice but to process it anyway, but if we can force the coordinatool to do archives in batch then we can use all the (archival) resources on a copytool without issue. OTOH we want to make sure the requests get processed before they time out, so some limit to each batches is in order.

Do you mean: if I have 3 puts on tag p1 and 2 on p2, the coordinatool sends the 3 puts then waits for them to finish then send the 2 for p2 then wait for them to finish... ? Or do you simply mean that you order the puts by project (and other tags)?

Given the other requirement for n[0-3], that means a two-level scheduling:
* archives get queued in a different list per full hint (`tag=riskA,tag=projectX,tag=n[0-3]`)

So each set of tags would mean a different list right? Not just a list per project for example. Meaning riskA,projectX,n0 riskB,projectX,n0, riskA,projectY,n0 and risk1,projectX,n1 would all be on different lists?

* the queues themselves are in an ordered list

* when a copytool requests work, it picks work from the first n[0-3] matching queue in that list

What do you mean by this? What is "the first n[0-3] matching queue"? Matching with what? The copytool?

  * the first time work is picked up on that queue, a timestamp is recorded
  * after batch_duration, that queue is put back at the end of the list, and work can continue on the next list
  * if new work comes in from lustre with a new tag, that also gets queued down at the end of the list, ensuring older work gets processed first as long as the work fits within batches (at first we discussed "deadlines", but there is no guarantee that there is enough bandwidth to process all requests in time before their deadline, and starting to mix things up will just make overall transfers slower -- the admin must ensure hsm.active_request_timeout is big enough to go through a few cycles of batches so requests get a few chances.

I think I understand this part and it makes sense to me.

courrierg commented 5 months ago

If you send a batch of puts to a copytool, you expect this batch to be stored on tapes that match the tags available. Since in Stéphane's configuration there are 4 drives per copytool, it means that your batch will be split roughly in 4 and written on 4 tapes using 4 drives (or less if there is mirroring involved). Depending on the use case, this may not be desirable. Spreading data across tapes for a single project increases the likelihood of having to load more tapes for restores. So in practice you may want to spread the load of the different projects on several drives instead to have a better data locality and reduce the work at restore. So if you have no mirroring for example, you could write files of 4 different projects on one copytool at the same time. If phobosd is able to do this, you just need a tag aware locate for archives in the coordinatool. You can simply send all the writes you want (using the answer from locate) to the copytools without worrying about unnecessary load/unload. We already do this for reads. Doing it for writes is (maybe) not that hard.

Meanwhile, for restore we're also not doing enough: we're currently doing a phobos locate when a requests comes in from lustre (if the tape is already mounted, send this request in priority there) and when we're scheduling a request to an agent (if it's mounted elsewhere don't send the request there but queue it in priority for that other agent), but that has two problems:
* even if it's a priority list there's no guarantee the tape will still be mounted by the time it's sent

* if a tape is mounted after the request has already been queued, the coordinatool won't know about it and these requests will stay scattered

A couple of things. This is not accurate. phobos_locate will actually chose a node to send the request to. The idea is this:

the coordinatool knows how many restore each client has and can roughly estimate which copytool has the smallest amount of work currently;
right before the locate, the coordinatool can find this "least busy" copytool and pass it as an argument to phobos_locate;
if the tape is already mounted somewhere, phobos_locate returns the name of the client and the coordinatool will send it to the right client;
otherwise, any client could be used to do the I/O but phobos_locate will choose the one given as an argument and "reserve" the tape for it (i.e. the least busy for example);
when the coordinatool finally sends the request to the copytool, since the tape is reserved, you can be sure that no one has used it so it will be able to load it and read from it;
now if another file restore on the same tape arrives after the reservation but before the first request is handled by the copytool (i.e. not loaded yet), since the tape is already reserved, the second locate will chose the same copytool and the request will end up in the same queue and be sent to the right copytool.

The only "bad" thing that can happen (unless I've missed something which is not unlikely), is that the copytool unloads a tape before the coordinatool is able to send a new restore on the same tape. But this issue can still happen with the solution you describe. If you build a queue of restores for tape A on the coordinatool, then decided to send them to the copytool and just after that you receive a new restore on tape A, you might miss the opportunity to not unload this tape. At this point, we are trying to predict the future. Might as well ask chat GPT. :) The grouped_read algorithm will manage the biggest queues first which means that the time window to send new restore requests grows with the number of restores on the same tape. So there is probably enough time before the tape is unloaded. We can also implement a timeout to wait before unloading a tape if needed/useful.

To solve these problems I think the coordinatool shouldn't focus on movers, but sort requests by tape, so my current idea is pretty similar to the archive scheduler: more queues! one per tape!
* queue requests per tape when they come in, the tape queues are ordered like for archives. Ideally the requests within a queue are sorted by position on the tape? Does that still make sense with current technos (and do we have the position on LTFS in phobos?)

The order of the request within a queue should definitely be the responsibility of Phobos. You can use RAO on IBM drives on LTFS but the interface is not user friendly (you need to write the raw SCSI request in a file and give the path to that file in a special xattr then the result is in another file...). Getting the position of the tape on LTFS is a tough question though. :) I still don't know if it possible. In theory it's not always possible since files could be fragmented but Phobos makes sure this does not happen. So we might be able to find the positions. I know there are a few LTFS xattrs that can return some information. We could also query the drive's head position after or before a put but I don't know how accurate this would be. Also, sorting by tape offset is not optimal on current tapes. So this is not a good idea to sort them even if the information was there.

* when we start restoring from a tape assign it to a mover, mark the start timestamp and keep picking from the same n tapes (this must be configured: if we send only requests for a single tape then the mover won't be able to do other restores. Ideally we could ensure tags are different for QoS on restores, so a single project won't hog all drives, and given how minio works we also probably want to say that if tape for riskA,projectX,n0 is being mounted we want to priorize other riskA,projectX,n[1-3] tapes? That is probably overthinking? Will ignore that for now.)

* After batch duration, move tape back; ideally the batch duration would be such that we never need to do that.

If you need implement this in grouped_read, this doable. The current algorithm processes all the requests on one tape before picking the next one. It also inserts new requests in the current queue. So if you keep receiving requests for this tape, it will not be unloaded until you have received all of them.

courrierg commented 5 months ago

A final thought. Doing the scheduling logic outside of Phobos is risky. The relationship between an object and its tapes is managed by Phobos and depends on the layout. If you have mirroring for example, each object will have n > 1 tapes associated to it. So you need to handle this case in the coordinatool. Now you can also have splits in Phobos. Meaning that a big object can be split in two parts (or more in theory). All of this logic is managed by phobos_locate. Moving this externally is probably not a good idea as this will expose Phobos' internals outside of its API. At some point, Phobos will also support HSM features. An object could be on disk or on tapes. It could be moved while the coordinatool is doing its scheduling. Managing tags inside the coordinatool for archives is probably fine though. But in the long run, I think this is a risky path to take at least for reads.

I would advise against doing too much optimization for reads in the coordinatool. I think currently, Phobos does what you need. Maybe we need to add the "batch_timeout" feature. For writes, we can implement a scheduler that manages requests per tags to avoid unnecessary loads. We can probably also do the same reservation logic for tags as we did for tapes on restores. We also plan at some point to have the possibility to group puts on a tape. So you could say "all the files of this directory should be stored on the same tape".

martinetd commented 5 months ago

I shouldn't have sent archives / restores in the same post, this makes replies hard to follow :P Trying to reply to archives points in this reply, will reply to restores next.

One important thing to note is that Phobos' FIFO scheduler is not strictly FIFO. If a request cannot be handled right now, it will try the next one, then the next... So if you send two puts on tag p1 then one on tag p2, it is possible that Phobos will still do p1, p2 then p1 which would trigger a useless load/unload. This can easily be fixed with a flag in the configuration for example. But I think if you focus only on the coordinatool for the scheduling, you might have some issues with the current behavior of Phobos (which can be fixed of course).

I think that is fine in practice as long as batches are big enough and that not all tapes are reused, e.g. it's not p1, p2, p1 but p1 x100 then p2 x100 when there is no more p1 requests available. So at the edge it's possible a p1 tape will be umounted for a p2 tape while some writes to p1 still wait, but hopefully we can get phobos to wait for the last few p1 writes and send them to the mounted drives? (hm, I guess that's not implemented so we have no such guarantee right now, coordinatool could optimize this by waiting until only n (drives available for archive on that mover) requests are in progress on the mover before sending requests for the next batch but I think that's micro-optimizing too much and in the raid5 case won't work anyway (don't know how many drives are used); so it's something phobos will have to be able to handle if that becomes a problem: it should know that there are still requests that could be fulfilled by currently mounted tapes so should prioritize sending to the currently mounted tape imo)

if the coordinatool sent an archive request for project X in the middle of a batch of Ys then phobos will have no choice but to process it anyway, but if we can force the coordinatool to do archives in batch then we can use all the (archival) resources on a copytool without issue. OTOH we want to make sure the requests get processed before they time out, so some limit to each batches is in order.

Do you mean: if I have 3 puts on tag p1 and 2 on p2, the coordinatool sends the 3 puts then waits for them to finish then send the 2 for p2 then wait for them to finish... ? Or do you simply mean that you order the puts by project (and other tags)?

I didn't think of waiting to finish; we can do this if that becomes a problem in the future but for now just ordering is fine. The problem isn't that at any given time there are 3 (p1)/2 (p2) requests, it's that if you keep getting a stream of requests from robinhood policies currently running you'll keep getting p1/p2 requests while the current p1 requests are processed. So we should keep p2 requests on hold until we've processed say 1h of archives for p1, then switch to p2 and keep any new p1 requests on hold for a while. Of course if there is no work for p1 left because we archived fast enough then there's no problem and p2 requests can be sent immediately, but archive requests coming from policy should come faster than we can archive immediately, so the list should never become empty here -- if it does then it means the tape movers are idle and we can afford the tape mount time.

So to rephrase the scenario is more like:

we get 100 requests on p1, 100 requests on p2
start scheduling p1, send 40 requests to copytools
get another 100 new requests for each (queue: 160 on p1, 200 on p2)
keep sending only p1 request for batch duration (e.g. 1h) or policy runs finish
after 30 minutes there are no more requests on p1, we got 3000 p2 requests pending so send them

A limit we can hit here is the max active requests lustre setting, this doesn't work as well if lustre cannot send all the pending requests to the coordinatool, but in this case there will still be a batching of the maximum number of requests at a time (if that is e.g. 10k and we switch because of it, we'll have at least 10k requests to process for other tag before coming back to p1), so for archives I think it's ok.

Another problem is the start of the batches; in practice I think you'll get the 100 p1 requests before you get any p2 requests, but if the policy runs start exactly at the same time you have few enough requests that the coordinatool might send them all (because there are enough movers available); in this case we might have something like send 3 p1 archives, 6 p2 archives, 9 p1 archive and finally by then we'd have enough requests pooled and a big batch of p2.. That will cause a few extra mounts, but I don't think that'll happen often in practice.

archives get queued in a different list per full hint (tag=riskA,tag=projectX,tag=n[0-3])

So each set of tags would mean a different list right? Not just a list per project for example. Meaning riskA,projectX,n0 riskB,projectX,n0, riskA,projectY,n0 and risk1,projectX,n1 would all be on different lists?

Yes, each would be different queues. My understanding is that they'll end up on different tapes (e.g. riskA would be normal writes and riskB would be mirror copy, so tapes won't be shared) the coordinatool cannot know how to split the different tags, we could teach it all the projects but that doesn't scale; I think it's best to regroup by fall tag:

even if they end up on the same tape ultimately there's no big loss, archivals aren't time sensitive
ideally I'd even go further and try to regroup as much as possible (e.g. per directory on lustre), as restores will likely be grouped by directory as well, but for now tags are enough; we can add more sorting criteria later once the code is in place and if that is really useful.

the queues themselves are in an ordered list

when a copytool requests work, it picks work from the first n[0-3] matching queue in that list

What do you mean by this? What is "the first n[0-3] matching queue"? Matching with what? The copytool?

I am talking about the new archive_on_hosts tag=n0 mover1 setting in the coordinatool -- mover1 should only take requests for tag=n0, so if the first 4 lists are p1,n1 / p3,n4 / p2,n0 / p1,n0 then when we schedule work for mover1 it should skip the first two n1/n4 lists and take work from p2,n0 first. I worded it as a search for each requests but in order to avoid splitting a project needlessly I think a mover will take the whole list, e.g. once mover1 has started taking work from p2,n0 then other movers will not be able to take work from it. Right now stanford only has one mover per n[0-3] anyway, but even if there are more I think we'll want to prioritize keeping all p1 data on the same few tapes rather than spread them on multiple tapes (through multiple movers -- depending on the phobos algorithm that might also cause tape stealing from one mover to the other as well if tape criteria make a mover think that tape is better...); but that is a detail.

If you send a batch of puts to a copytool, you expect this batch to be stored on tapes that match the tags available. Since in Stéphane's configuration there are 4 drives per copytool, it means that your batch will be split roughly in 4 and written on 4 tapes using 4 drives (or less if there is mirroring involved). Depending on the use case, this may not be desirable. Spreading data across tapes for a single project increases the likelihood of having to load more tapes for restores. So in practice you may want to spread the load of the different projects on several drives instead to have a better data locality and reduce the work at restore. So if you have no mirroring for example, you could write files of 4 different projects on one copytool at the same time. If phobosd is able to do this, you just need a tag aware locate for archives in the coordinatool. You can simply send all the writes you want (using the answer from locate) to the copytools without worrying about unnecessary load/unload. We already do this for reads. Doing it for writes is (maybe) not that hard.

Hmm that is a good point, is it better to force a max number of drives per tags so we get better data locality or focus on write throughput... Right now phobos does not have enough logic on writes, so there is no choice anyway, but if phobos becomes able to schedule writes a bit more smartly we should add a locate-like call during archival with (tag,size) as well:

we cannot guess parity (single/raid1/raid5) for a given tag, so cannot know how many queues a mover can handle; we want to avoid sending more work than can be writed in parallel to avoid remounts; so the locate call will need to return how many movers are consumed as well.
the queues index will just change from the lustre hint (tag=risk,tag=proj,tag=n) to some selected tape ID returned by the locate (for raid1/raid5 I assume the tapes are grouped? so that can be a leader tape ID or something)
that will allow reserving tapes (didn't know we do that for read, thanks!), so phobos won't umount a tape immediately if there is a short period with the request stuck in coordinatool but not sent yet to the mover

So I don't think it's incompatible, just a first step in the grand scheme: first some basic scheduling in coordinatool, then we can improve it.

martinetd commented 5 months ago

For restores:

A couple of things. This is not accurate. phobos_locate will actually chose a node to send the request to.

Ah, so this isn't so much a locate as a real scheduling call, that is good. In this case I agree phobos knows best, so the coordinatool implementation should be made simpler, but we still need some grouping imo; just with much shorter batch size than writes. The problem is that with restores if you have more requests on more tapes than currently processable, you'll still have to pick a mover multiple times, so we might end up with a queue like mover1: [file@tape1, file@tape2, file@tape3, file@tape1] -- even if the mover has been selected by locate we won't know about the file@tape1 hint and we'll cause that tape to umount/remount. So this is actually exactly like the "phase 2" of archivals, phobos locate should return not a mover but a (mover, tape ID, number of drives required) like for archival. The number of drives required here is a bit of a guess (for raid5 if a tape has problems it might get messy), but in general it should hold and allow scheduling just enough requests that can be processed in a generic way In case of HSM, number of drives required can be 0 if the file is on disk, at this point that request doesn't need to wait for tapes so can be sent at any time and that can get different scheduling again.. I don't think that'll be a problem, but I don't want to handle too much too fast, iterative steps that are simpler to implement is better.

Ultimately the problem is that phobos doesn't have a full view of the requests, so when I'm saying "implement in coordinatool" I really just mean that this should be done at coordinatool level where we know all the pending requests, but this can be hidden in a single phobos_locate() call; we just need to get more infos from it. If we can get phobos to drive the coordinatool scheduling more smartly then we don't need to leak too much implementation details.

Anyway, reads aren't a priority right now (all data is on lustre disk), I think we can discuss this again later after writes are done.

position on tape

right so LTFS hides that info as I expected, we can probably remember the order in which the files were written to the tape as an approximation but that is probably not worth worrying about right now anyway; also storing this for later. (Would be interested to compare how fast you can read e.g. 10 sequential files vs. same 10 files out of order on the tape first, e.g. just mount it and time cat > /dev/null in written order vs random order for a few randoms.. I don't have real hardware so the VTL isn't representative here)

If you need implement this in grouped_read, this doable. The current algorithm processes all the requests on one tape before picking the next one. It also inserts new requests in the current queue. So if you keep receiving requests for this tape, it will not be unloaded until you have received all of them.

I'll also check that before doing any work on reads; thanks.

courrierg commented 5 months ago

I think that is fine in practice as long as batches are big enough and that not all tapes are reused, e.g. it's not p1, p2, p1 but p1 x100 then p2 x100 when there is no more p1 requests available. So at the edge it's possible a p1 tape will be umounted for a p2 tape while some writes to p1 still wait, but hopefully we can get phobos to wait for the last few p1 writes and send them to the mounted drives? (hm, I guess that's not implemented so we have no such guarantee right now, coordinatool could optimize this by waiting until only n (drives available for archive on that mover) requests are in progress on the mover before sending requests for the next batch but I think that's micro-optimizing too much and in the raid5 case won't work anyway (don't know how many drives are used); so it's something phobos will have to be able to handle if that becomes a problem: it should know that there are still requests that could be fulfilled by currently mounted tapes so should prioritize sending to the currently mounted tape imo)

Reading your answer makes me realize that what I said is not entirely accurate. If you have tape A in drive 1 and tape B in drive 2, writes with tags for tape A will try to go to drive 1 (same for tape B). What can happen is when drive 2 finishes its current I/O, a new request will be scheduled on it. FIFO does not guarantee that the request will be for tags of tape B. It could be trying to schedule a write for tags of tape A, see that tape A is busy, see that drive 2 is not. Unload tape B from drive 2 then reload a new tape with the same tags as tape A. But the issue I mentioned still exists. We probably need to enforce a real FIFO at some point. Or have a better I/O scheduler for writes.

Another problem is the start of the batches; in practice I think you'll get the 100 p1 requests before you get any p2 requests, but if the policy runs start exactly at the same time you have few enough requests that the coordinatool might send them all (because there are enough movers available); in this case we might have something like send 3 p1 archives, 6 p2 archives, 9 p1 archive and finally by then we'd have enough requests pooled and a big batch of p2.. That will cause a few extra mounts, but I don't think that'll happen often in practice.

I definitely think we should implement something smarter at Phobos' level. These issues should not exist. You should be able to say "I know I have 4 drives on this copytool, I should be able to send 4 different sets of tags to this copytool". Or something like that. But given that Phobos doesn't do it yet, your solution seem to be a good workaround.

* ideally I'd even go further and try to regroup as much as possible (e.g. per directory on lustre), as restores will likely be grouped by directory as well

There is a feature that we want to implement in Phobos to support this. We call it "grouped put" for now. The idea is to be able to do exactly this. Group related objects on the smallest amount of tapes possible. How this will be implemented is still not clear yet. And how will the coordinatool integrate with this either.

I am talking about the new archive_on_hosts tag=n0 mover1 setting in the coordinatool -- mover1 should only take requests for tag=n0, so if the first 4 lists are p1,n1 / p3,n4 / p2,n0 / p1,n0 then when we schedule work for mover1 it should skip the first two n1/n4 lists and take work from p2,n0 first. I worded it as a search for each requests but in order to avoid splitting a project needlessly I think a mover will take the whole list, e.g. once mover1 has started taking work from p2,n0 then other movers will not be able to take work from it. Right now stanford only has one mover per n[0-3] anyway, but even if there are more I think we'll want to prioritize keeping all p1 data on the same few tapes rather than spread them on multiple tapes (through multiple movers -- depending on the phobos algorithm that might also cause tape stealing from one mover to the other as well if tape criteria make a mover think that tape is better...); but that is a detail.

This seems like a good strategy. For the tape stealing, Phobos will not swap tapes with another daemon as long as the tape is locked by the daemon (i.e. as long as the tape is loaded in a drive). What can happen though is that when copytool 1 decides to unload tape A, copytool 2 can choose this tape and reload it. The fact that copytool 2 chooses tape A over another one doesn't have a performance cost. If copytools favor tapes with the most size available, this won't happen in practice. Although this strategy has other issues. But it's as long to load tape A as any other tape for copytool 2. What would be better is that the coordinatool never sends requests that could be done on tape A to copytool 2. But if we have several movers for the same tags as you mentioned, this will not be possible. So this issue will likely always be there no matter how smart Phobos and the coordinatool are. And in practice, phobosd should only unload a tape when it is full. So the tape won't be chosen again by another copytool.

* we cannot guess parity (single/raid1/raid5) for a given tag, so cannot know how many queues a mover can handle; we want to avoid sending more work than can be writed in parallel to avoid remounts; so the locate call will need to return how many movers are consumed as well.

locate can return this kind of information but I feel like this is the wrong way to go. In the long run, the coordinatool should not have to worry about remounts. I understand that you want a solution that works now. But I don't think it is worth developping a locate at put which would return the number of drives involved. Ideally, what the coordinatool needs is an indication of the amount of work each copytool has. Then decide where to send the next request based on this. This will of course depend on the number of drives per copytool.

* the queues index will just change from the lustre hint (tag=risk,tag=proj,tag=n) to some selected tape ID returned by the locate (for raid1/raid5 I assume the tapes are grouped? so that can be a leader tape ID or something)

Not yet, but this is planned. For now, layouts can choose any tapes they want for an I/O. This will change (soon ?).

courrierg commented 5 months ago

Ah, so this isn't so much a locate as a real scheduling call, that is good.

Good point, we should probably rename it when we implement the put version. :)

In this case I agree phobos knows best, so the coordinatool implementation should be made simpler, but we still need some grouping imo; just with much shorter batch size than writes. The problem is that with restores if you have more requests on more tapes than currently processable, you'll still have to pick a mover multiple times, so we might end up with a queue like mover1: [file@tape1, file@tape2, file@tape3, file@tape1] -- even if the mover has been selected by locate we won't know about the file@tape1 hint and we'll cause that tape to umount/remount.

I'm not sure I understand what you are trying to say here. If you send all the restores you have to the copytool returned by locate, you should not have any useless mount/unmount. Unless of course if a restore arrives right after a tape is unloaded but this issue will always happen even at the coordinatool level. Now if the coordinatool tries to hold onto some requests to avoid overloading the copytool, then yes you might have issues with unnecessary mount/umount because this is not how the locate was designed. Now if that's really what you want to do, why not but the locate will need rework. In practice, I think the current locate + grouped read should prevent most unnecessary load/unload.

Ultimately the problem is that phobos doesn't have a full view of the requests, so when I'm saying "implement in coordinatool" I really just mean that this should be done at coordinatool level where we know all the pending requests, but this can be hidden in a single phobos_locate() call; we just need to get more infos from it. If we can get phobos to drive the coordinatool scheduling more smartly then we don't need to leak too much implementation details.

Yes, I agree. My current understanding is that you need what is available for reads also for writes (i.e. locate on tags + grouped write per tags).

right so LTFS hides that info as I expected, we can probably remember the order in which the files were written to the tape as an approximation but that is probably not worth worrying about right now anyway; also storing this for later. (Would be interested to compare how fast you can read e.g. 10 sequential files vs. same 10 files out of order on the tape first, e.g. just mount it and time cat > /dev/null in written order vs random order for a few randoms.. I don't have real hardware so the VTL isn't representative here)

My understanding of the literature is that sequential reads is not optimal and can even be near the worst case because of the way data is stored physically on the tape. I would guess that random and sequential are not that far apart. We also plan on working on this at some point but this is not a high priority for us right now.

martinetd commented 5 months ago

locate can return this kind of information but I feel like this is the wrong way to go. In the long run, the coordinatool should not have to worry about remounts. I understand that you want a solution that works now. But I don't think it is worth developping a locate at put which would return the number of drives involved. Ideally, what the coordinatool needs is an indication of the amount of work each copytool has. Then decide where to send the next request based on this. This will of course depend on the number of drives per copytool.

I think we're in agreement here (there is no need to keep separating the two), if coordinatool properly calls into phobos it's equivalent.

Long term, the coordinatool should call into phobos for scheduling; we don't want to configure everything (policies for each tag, number of drive on each mover, etc) -- what I've done with archive_on_hosts is just a first step because it was quick to do, and the first step I've suggested here (just group by tag) is basically a no-op scheduler, but I agree we want to call into phobos everytime and let it do its job. If phobos is called early it has access to the same informations as coordinatool wrt pending requests, so it can make the appropriate decisions of sending this put to these tapes on this mover early in coordinatool, and coordinatool/movers "just" have to abide by this.

Ultimately you're correct that coordinatool should not have to worry about the number of drives per mover/tapes per archive, I suggested that as the most straight-forward mapping for coordinatools' internal scheduling in the way I picture it, but just selecting the client directly at phobos level might be enough if the "phobos_schedule()" call can also say "don't schedule this request yet" and is also called again regularly on these tapes... The problem I see is that phobos has no way to call back into coordinatool to say "okay, now you can schedule this batch" next" at this point, so part of the scheduling currently has to be internal to coordinatool, but perhaps a few more hooks can work as well to avoid the double work.... Something akin to passing the request to an opaque phobos queue, and phobos deciding when to take it out of it into a mover when a mover becomes free? e.g.

request comes in, first phobos_schedule_new(req) call either send to mover directly if it is waiting or the phobos queue
a mover asks for work, call phobos_schedule_next(mover)

Note this is different from enrich and having mover queues like we do for restores: having a mover queue is not flexible enough if you have multiple tags, as you could overflow a mover with requests for a single tag while having other tags in the backlog. phobos would need to be called everytime, and would have to ensure at least two requests for each tag are consistently scheduled to avoid down time between done and next recv.

Also note that if we do this kind of scheduling we don't even need to separate archive vs restore, this archive/restore QoS can and should also be done at phobos level so that when a mover asks for more work after doing an archive phobos could decide if it is more appropriate to handle a recv request or keep doing archives more smoothly.

If that sounds ok for you, I'll try to see if I can make the "one queue per lustre hint" v0 work match this API so we can plug in phobos directly later; internally this can be done exactly as I was describing previously, just hidden neatly behind scheduling calls and separate list nodes in structs a bit.

the queues index will just change from the lustre hint (tag=risk,tag=proj,tag=n) to some selected tape ID returned by the locate (for raid1/raid5 I assume the tapes are grouped? so that can be a leader tape ID or something)

Not yet, but this is planned. For now, layouts can choose any tapes they want for an I/O. This will change (soon ?).

That's very good to know, I think it's definitely better to group at tape level to ensure restores are grouped correctly (wouldn't want to have to mount 3 different tapes to restore files that were sent in a raid1 batch); Stanford doesn't use raid1/5 so this isn't an immediate problem here, and I'll leave this up to you before someone starts using these in production.

(Also not replying to the restore part yet, but I've read it; I want to focus on archives short term & in this issue. I'll open a new issue for restore to continue discussion once archives have taken shape)

martinetd commented 4 months ago

Sorry for the delay, I was thinking about what kind of hooks we need/want and how to manage the scheduling. It's not as obvious as I thought, but can probably work out... I think?

long post warning, mostly to organize my own thoughts; feel free to skip straight to the end.

current behaviour

Full recap of current queuing/scheduling

Here's what we have currently for queuing

request comes in (hsm_action_enqueue() from lustre or hsm_action_enqueue_json() from redis on recovery)
hsm_action_enqueue_common() calls hsm_action_node_enrich(), which will add the request either directly to a client queue or the global queue a. if mappings are set, apply mapping to archive request b. otherwise phobos_enrich(): calls phobos_locate() with oid... This tickles down to layout_raid1_locate() to pick the best host to use (unrelated note: raid5 has no locate and layout_locate will segfault if called?)

Scheduling is done on these conditions:

when a client disconnects, we wait for a given grace period and if it didn't reconnect by then free it. At that point, shove all its requests to the global queue and schedule all waiting clients
when a batch of requests came from lustre, after the above enrich schedule all waiting clients
when a given client request work (recv), try to schedule it immediately
when a client finished some work (done), if it was in recv state try to schedule something

The scheduling itself does:

in order restore, remove, archive & within the limits of actions we can send (max count per category / max bytes requested)
for all actions pending in client's queue, then all actions in global queue
schedule_can_send() checks phobos_locate() again and checks the best host to use is still the same or unset
- if it changed the request is moved at the back of that new host's client queue or global queue if not found
enqueue & repeat

data structures

as far as scheduling goes, we have:

state (available everywhere) has:

global queues (queues.waiting_{restore,remove,archive})
a list of waiting_clients (pending recv reply)
- (also all clients in stats.clients + stats.disconnected_clients)

each client also has:

client queues (queues.waiting_{restore,remove,archive})
list of actions currently being processed on it (active_requests)

each hsm request (hsm_action_node) has some details used in case of cancel and embedded list node, but nothing really related to scheduling.

current hooks

as listed above we only have hsm_action_node_enrich(hsm_action_node) at enqueue time and immediately when before a request is sent, and schedule_can_send(client, hsm_action_node) after re-enriching/before sending.

forward

problems

The current approach has a few limitations (we're not addressing everything here):

the scheduling within a client is 100% linear, so if e.g. we're processing two tapes on a client we can't advance requests as they are processed for each tape easily (in practice we could skip requests on busy/unrelated tapes through schedule_can_send() systematically, but if e.g. phobos wants to wait it'll have to refuse all currently pending requests every time work is scheduled on this client)
this is basically the same as above, but enriching has nowhere to store custom data that it might use to pull suitable requests more efficiently, and is limited to using the existing clients queues.
there's no hook on done, so the hooks don't know when a request is done (this is required to e.g. balance requests between two tapes, new requests have to take into account the work that finished. For phobos that info can be obtained through the phobos locks and can_send hook, so this is probably not needed in our specific case)
phobos isn't called when a client died -- this is probably fine in practice as I'm not sure it could dismount tapes from a dead mover anyway, so these would have to wait?
- if we reschedule these requests somewhere else they'll just be stuck if no raid? (@thiell -- minio built-in raid5 can very likely handle failed/corrupt files, but how does it handle a read stuck in restore?)
- when a client is freed we probably should enrich the requests again instead of shoving to global queue; that might be enough.
we can't react to external events e.g. another client finish work on a tape we needed for raid5 somewhere else; this is probably ok and will be better addressed by making raid5 groups of tapes fixed in which case this need should never arise

plan

Given the above I don't think it's a good idea to try to implement some 100% custom scheduling, e.g. adding opaque data to structs and making hooks manage it is much more complexity than it's worth.

1. simple batching

For batching we don't need to modify the queues as we'll keep it simple:

clients struct get added a batch string that'll be the tag + deadline
when enriching archive requests, only requests that match the tag get queued into clients queue.
if deadline expires, unset tag, shove all client's requests at the end of the global queue and pick a new tag from global queue moving all these requests to the client
if the client's queue becomes empty on done, pick a new tag & schedule global requests as above
if client's queue is empty when enriching, allow setting new tag from enriched value.
do we want to forbid two clients having the same reserved tag? I think it's fine either way; if we forbid tag reuse we can get rid of the map to host code, if we keep the map to host code we'll basically get this for free in Stanford's context, so ultimately it depends on whether we want to remove the archive_on_host setting. Having a 0-conf setting (well, deadline and acceptable number of duplicates generic settings instead of explicit mapping) is probably worth the check.

That should be straightforward enough to implement quickly and won't be too intrusive in the code -- in particular this is a bit less efficient than the "one queue per lustre hint" I had suggested in my previous comment, but it's less code and I don't think that'll be a problem in practice given we'll only need to go through the whole lists if no other work is happening or once per deadline, so it should be acceptable even if the lists grow very big.

2. phobos scheduling for archives

We talked about a phobos_locate() for put; I still think it makes sense in theory.. But ultimately I don't see the benefit. Do we get something from phobos being able to schedule puts at this point?

If we get path on lustre as part of the enrichment and start more complex "grouping by directories" it might be worth it, as phobos will be able to get back the tape that was used for that directory in an earlier batch that just coordinatool won't be able to do, but right now I don't see any benefit. Even if we do grouping by directory, I think the grouping by tag can still be left to coordinatool to avoid db round-trips in a first step.

So timeline would be something like:

in phobos db, add a hash of file directory in db or something so we can get requests or tapes per directory?
in coordinatool enrich each requests with its lustre path first (ignore hardlinks? alpha sort and always pick the fist one? Actually don't need to get the whole path, just the parent directory fid is enough?)
add a phobos_locate for put that takes a given tag+directory+file size and returns the best tape that already has data for it + pick a mover for it. If no tape already had data in this directory just keep value empty.
in coordinatool, after checking tag and before enqueuing to client queue try to call that and shove back to global queue if one of the movers with that tag didn't get returned (tag gets priority)
alternatively if phobos pre-assigns a tape for put immediately, I still think my idea of one queue per assigned tape (or tape group when that happens for raid5, or whatever grouping id we want phobos to return) makes sense. In that case the tag filtering stops being useful and we can let the "can send" hook decide if a mover can take on work from another queue or not (I really don't like that we're calling this for every pending request for every client for every scheduling call; having this called once per request group is much more reasonable). Having written tape here I don't think tape is accurate, what we want here is just an assurance phobos will group all these requests later, so it can really be an arbitrary id -- in particular we don't need to worry about space left in tape candidate for put, it doesn't matter if it doesn't fit and phobos needs to mount a new tape, the following requests must just go to that newly mounted tape.

3. improvement of phobos scheduling for restores

[off topic] I had misunderstood https://github.com/phobos-storage/phobos/issues/10#issuecomment-2166280345 's explanation of phobos_locate() to be what it currently does, but it currently doesn't reserve a tape/mover if the tape isn't mounted anywhere (layout_raid1_locate() is way too complicated...); so I think my comment still stands - we're not considering drive-by mounts, e.g. this scenario:

plenty of restores come, nothing was mounted so initial enrich finds no mover
first restore is scheduled, tape mounted
second restore is considered, it also wasn't mounted so it's also accepted to mover
meanwhile third+ restore could have gone on same tape as first restore but get processed later

With the current code we'd basically need the same as the "when we pick a new tag go through all global queue requests" of the simple batching for archives. Not having thought through this at all I think that might be good enough, and that sounds better to me than allocating a mover early on during initial request enqueue - we should keep 'free' requests in the global queue for as long as possible so they can get scheduled to the first available mover ASAP.

Like puts, having restores grouped by tape group would allow more efficient can send checks, so returning that info instead of a host name for the first enrich is probably better. The later enrich can and probably should stay to check tape wasn't mounted elsewhere meanwhile for some reason, it's less work to check all the time than make a mistake.

[/off topic]

Anyway this point is far off the road, I'm just dumping my thought for myself later, let's rediscuss restore improvements later.

So tl;dr:

unless any good reason comes forward I'll start impemented batch archive per tags as above without multi-queue, as simple as possible.
- let's take some more time to discuss 2
- you said "We also plan at some point to have the possibility to group puts on a tape" - is there anything concrete yet? would something like what I described just using the parent directory work?
- pre-reserving tapes on put (or whatever group id) is probably orthogonal enough, but it'd likely be less work to do both together before doing coordinatool scheduling around it, so if that makes sense we should hash this out a bit
- sorry to say that after having written about restores here, but let's keep ignoring restores a while longer. One step at a time. Feel free to comment, but I'd rather not spend too much time on details about it right now, my opinion will likely change again after having done the archive work.