phobos-storage / phobos

This repository holds the source code for Phobos, a Parallel Heterogeneous Object Store.
GNU Lesser General Public License v2.1
3 stars 2 forks source link

RFC: fair_share_max per tag #10

Open thiell opened 6 months ago

thiell commented 6 months ago

I believe we would greatly benefit from having limits of (local) drives that can be use but per specific tag.

Similarly to:

# Maximum number of LTO9 drives for read, write and formats respectively
fair_share_LTO9_max = 4,4,4

...but by tag :)

Note: for fair_share by tag, format would not be relevant, only drives for read and write.

Our use case is as follow: we have tapes with different/disjoint tags (for example: n0, n1, n2, n3) , and we are archiving files to all of them at the same time. We would like to have something like

# fair_share_max[<tag>] = <max_drives_read>,<max_drives_write>
fair_share_max[n0] = 1,1
fair_share_max[n1] = 1,1
fair_share_max[n2] = 1,1
fair_share_max[n3] = 1,1

So that for any tape mounted containing tag "n0", 3 other drives could remain available for the other tags. What do you think?

courrierg commented 6 months ago

Hello @thiell,

This sounds like an interesting feature. It also looks like something quite complex to implement. The fair share was implemented with 3 groups (read, write, format). What you are asking for is to extent this algorithm to an arbitrary number of groups (i.e. tags). This means that the current implementation of the fair share would need a lot of rework.

One of the things that may be a bit complex is the relationship between the fair share algorithm and the I/O schedulers. For the current fair share algorithm, we have a one-to-one mapping between the groups of requests and the schedulers. With your use case, this is a bit more complex. The issue is that the current I/O schedulers don't really care which drive gets the request. The fair share algorithm makes the repartitions between the schedulers and then the schedulers are free to use their drives as they want.

Which means that your use case would imply to have a new fair share in each I/O scheduler. Or something like this.

Could you describe the use case for this? Do you want:

The other drawback of having a QoS before the current fair share is that we may miss some optimization opportunities in the grouped read algorithm. If a read is not sent to the scheduler because the user has reached his quota, we may have to do an additional reload which is arguably worse in terms of performance. So the QoS could actually have a non negligible impact on the overall performance.

Since this feature seems quite complex as you described it, it would be interesting to know the actual use case to see if another simpler solution exists.

One last question, would you still use the current fair share algorithm with this new one? Or just the new one?

thiell commented 6 months ago

Hi @courrierg!

Thanks for your answer. I think all 3 points that you described could be valid in practice! :)

Now, I've done more testing and realized something that is now our main issue: there is no phobos_locate() done by the coordinatool for archive requests. This is very problematic for us with 4 data movers and 16 drives, because the combination of tags (up to 16, but more likely 8 in practice) that we have is higher than the number of drives per data movers (4). When we archive using different tags, the "tagged" hsm archive requests are randomly sent to a data mover by the coordinatool, which may or may not have a tape loaded matching the specified tag list, triggering a lot of tape movements.

I'm working with @martinetd to see how best we could add a locate feature by tags in the coordinatool. What we would like to achieve is this: if a tape is already mounted/in use in a drive by phobos, and its tags match the tags of an archive request, I would like the corresponding data mover (that has the drive and tape) to be used for this archive request.

I imagine this might not be how all phobos sites want to operate (if you have fewer tags, it might be best to use the maximum number of tapes in parallel, even with the same tags). But in my case, I want to minimize tape movement and regroup the writes with the same tag on the already mounted tape as much as possible. Perhaps we could do (1) locate-by-tags-on-archive + (2) fair share per tag at the coordinatool level.

And yes, I think the current fair share algorithm by tape model should be kept (basically it's already implementing some association between tapes and drives, the "tag" here being the model, eg. "LTO9"). In any case, I would still use that.

martinetd commented 6 months ago

Yes I think the coordinatool could easily get the list of mounted tapes/tag to check if a mover already has a valid candidate mounted; but if you want a fair share we'll need to do this at phobos level too as I think that if you send a request to a copytool it'll probably try to use any free drive that is locally accessible, so it'll probably mount more tapes if there are idle drives even if a candidate is already mounted (the copytool sees all the drives on the mover? I need to finish setting up the full Lustre/HSM chain to confirm this..)

So we need both:

As first approximation though if you have enough nodes there are quick & ugly ways of doing this, e.g. dedicating a node for a tag for archives; that'll probably greatly limit the number of mounts/umounts you see.

thiell commented 6 months ago

To provide more context, we use Robinhood v3 that runs an archive policy and use tags extracted from the action_params of the target_fileclass, so our tags depend on the path of the file on the filesystem (we have standardized this part for our needs: risk classification, projects and minio erasure coding stripe index). Example below. I am not sure how exactly I could dedicate a data mover node by tag, unless I make use of archive_id perhaps, and run multiple coordinatools (one per archive_id), each targeting a specific data mover. 🤯 I've also been thinking running smaller hsm archive policies at the robinhood level, targeting fewer tags at the same time, to avoid tape movement, but that also has some other issues. That's why I think we definitely need a configurable locate by tag for archive.

FileClass mr {
    definition { tree == "/elm/*/mr" }
}
...
FileClass p-test {
    definition { tree == "/elm/*/*/projects/test" }
}
...
FileClass minio_n0 {
    definition { tree == "/elm/*/*/*/*/minio/n0" }
}
...
FileClass mr_test_minio_n0 {
    definition { mr inter p-test inter minio_n0 }
    lhsm_archive_action_params {
        risk = mr;                                           <<< tag passed to phobos when archiving
        project = p-test;                                    <<< tag passed to phobos when archiving
        minio_n = n0;                                        <<< tag passed to phobos when archiving
    }
}

This is our Robinhood v3 HSM archive policy rule, with up to 16 combinations of tags:

lhsm_archive_rules {
    ignore_fileclass = system;
    # Do not archive MinIO system/temporary files
    ignore_fileclass = miniosys;

    # Common archive rule for minio files
    rule archive_minio {
        #
        # Policy rule that targets specific classes - they must be disjoint!
        # Matching action params will be automatically resolved and passed
        # to the action command as Phobos tags.
        # IMPORTANT: make sure the tags are pre-defined in Phobos!
        # 
        # mr/p-test
        target_fileclass = mr_test_minio_n0;
        target_fileclass = mr_test_minio_n1;
        target_fileclass = mr_test_minio_n2;
        target_fileclass = mr_test_minio_n3;
        # mr/p-test2
        target_fileclass = mr_test2_minio_n0;
        target_fileclass = mr_test2_minio_n1;
        target_fileclass = mr_test2_minio_n2;
        target_fileclass = mr_test2_minio_n3;
        # mr/hnc
        target_fileclass = mr_hnc_minio_n0;
        target_fileclass = mr_hnc_minio_n1;
        target_fileclass = mr_hnc_minio_n2;
        target_fileclass = mr_hnc_minio_n3;
        # mr/campus
        target_fileclass = mr_campus_minio_n0;
        target_fileclass = mr_campus_minio_n1;
        target_fileclass = mr_campus_minio_n2;
        target_fileclass = mr_campus_minio_n3;

        # Archive to Phobos with tags
        action = cmd("lfs hsm_archive --archive {archive_id} --data 'tag={risk},tag={project},tag={minio_n}' {fullpath}");

        # Common policy rule condition
        condition { size > 0 and last_mod >= 12h }
    }
}
martinetd commented 6 months ago

unless I make use of archive_id perhaps, and run multiple coordinatools (one per archive_id), each targeting a specific data mover

coordinatool accepts all archive ids by default, so this actually isn't that complicated (would need to start the final phobos copytools with one archive id per class and have robinhood target that), but that'd also make the mover restricted for restores and give less flexibility for the future (afaiu, once a file is stored with an archive id, the restores must use it)

I was thinking just adding a map describing if keyword x is found in the --data string then use mover y for achival; that's very easy to implement and doesn't require talking to phobos at all.

But once again, asking phobos where something using tag x is already mounted probably isn't that difficult; I think if we just implement it that way we'll end up in a similar situation. I'll check a bit more during this week.

courrierg commented 6 months ago

Yes I think the coordinatool could easily get the list of mounted tapes/tag to check if a mover already has a valid candidate mounted;

This is what phobos_locate does for gets. We planned on adding this for the puts as well.

but if you want a fair share we'll need to do this at phobos level too as I think that if you send a request to a copytool it'll probably try to use any free drive that is locally accessible, so it'll probably mount more tapes if there are idle drives even if a candidate is already mounted (the copytool sees all the drives on the mover?

Yes, this is true. The only I/O scheduler for writes is FIFO which tries to use all the drives available. If you want other scheduling strategies, we need a new I/O scheduler for writes. We also planned on having a better scheduler for writes. We should definitely discuss what your needs are.

So we need both:

* coordinatool needs to query phobos on archive too, not just restore, to get any candidate that's already mounted (it doesn't have to be 1, could round-robin between n candidates and pick a free drive if there is less than n candidates mounted)

We can easily extend phobos_locate for puts and have it return a list of hosts instead of just one. Then the coordinatool can pick the best one.

* eventually need a way to tell the phobos copytool to wait for a specific drive, and not try to remount another tape that could be used for something else shortly afterwards

This would be a new I/O scheduler.

The need for a tag aware locate at put and a new I/O scheduler for writes is something that is on our roadmap. This is not the first item of the roadmap though. :)

martinetd commented 5 months ago

Some loud thinking on this topic.

First, current requirements for Stéphane if I understood them correctly:

All in all, that's a lot of different populations, it's not surprising that sending requests at random generate a lot of tape mounts.

The current approach right now is:

With this tape movements have been reduced nicely, but the policy scheduling is completely manual and will not scale. As soon as multiple policies overlap files for different projects are archived and phobos generates a lot of tape movement, which was the reason for this issue (original idea of limiting the number of drives for a project so if there's is a batch of requests for a given project all drives won't be remounted to that project). However, that idea means that if there is only a single project being archived most drives will stay idle, which isn't ideal either...

After discussing with Stéphane we think that can be better addressed by a higher level scheduling on the coordinatool: if the coordinatool sent an archive request for project X in the middle of a batch of Ys then phobos will have no choice but to process it anyway, but if we can force the coordinatool to do archives in batch then we can use all the (archival) resources on a copytool without issue. OTOH we want to make sure the requests get processed before they time out, so some limit to each batches is in order.

Given the other requirement for n[0-3], that means a two-level scheduling:

Doing something like that means that even if policies overlap, the mount operations will only happen when a batch times out, so phobos can spend more time writing. What do you think?

Meanwhile, for restore we're also not doing enough: we're currently doing a phobos locate when a requests comes in from lustre (if the tape is already mounted, send this request in priority there) and when we're scheduling a request to an agent (if it's mounted elsewhere don't send the request there but queue it in priority for that other agent), but that has two problems:

To solve these problems I think the coordinatool shouldn't focus on movers, but sort requests by tape, so my current idea is pretty similar to the archive scheduler: more queues! one per tape!

Ultimately there isn't much change needed in phobos here; for archives we have all we need directly from lustre so it's not going to be phobos dependant. For restores we'll need to get more details about each achive during the enriching phase, I'll check how much info we can get.

I'll probably be working with Stéphane on some of that these couple of months, feedback if welcome before I get to it :)

courrierg commented 5 months ago

After discussing with Stéphane we think that can be better addressed by a higher level scheduling on the coordinatool

One important thing to note is that Phobos' FIFO scheduler is not strictly FIFO. If a request cannot be handled right now, it will try the next one, then the next... So if you send two puts on tag p1 then one on tag p2, it is possible that Phobos will still do p1, p2 then p1 which would trigger a useless load/unload. This can easily be fixed with a flag in the configuration for example. But I think if you focus only on the coordinatool for the scheduling, you might have some issues with the current behavior of Phobos (which can be fixed of course).

if the coordinatool sent an archive request for project X in the middle of a batch of Ys then phobos will have no choice but to process it anyway, but if we can force the coordinatool to do archives in batch then we can use all the (archival) resources on a copytool without issue. OTOH we want to make sure the requests get processed before they time out, so some limit to each batches is in order.

Do you mean: if I have 3 puts on tag p1 and 2 on p2, the coordinatool sends the 3 puts then waits for them to finish then send the 2 for p2 then wait for them to finish... ? Or do you simply mean that you order the puts by project (and other tags)?

Given the other requirement for n[0-3], that means a two-level scheduling:

* archives get queued in a different list per full hint (`tag=riskA,tag=projectX,tag=n[0-3]`)

So each set of tags would mean a different list right? Not just a list per project for example. Meaning riskA,projectX,n0 riskB,projectX,n0, riskA,projectY,n0 and risk1,projectX,n1 would all be on different lists?

* the queues themselves are in an ordered list

* when a copytool requests work, it picks work from the first n[0-3] matching queue in that list

What do you mean by this? What is "the first n[0-3] matching queue"? Matching with what? The copytool?

  * the first time work is picked up on that queue, a timestamp is recorded
  * after batch_duration, that queue is put back at the end of the list, and work can continue on the next list
  * if new work comes in from lustre with a new tag, that also gets queued down at the end of the list, ensuring older work gets processed first as long as the work fits within batches (at first we discussed "deadlines", but there is no guarantee that there is enough bandwidth to process all requests in time before their deadline, and starting to mix things up will just make overall transfers slower -- the admin must ensure hsm.active_request_timeout is big enough to go through a few cycles of batches so requests get a few chances.

I think I understand this part and it makes sense to me.

courrierg commented 5 months ago

If you send a batch of puts to a copytool, you expect this batch to be stored on tapes that match the tags available. Since in Stéphane's configuration there are 4 drives per copytool, it means that your batch will be split roughly in 4 and written on 4 tapes using 4 drives (or less if there is mirroring involved). Depending on the use case, this may not be desirable. Spreading data across tapes for a single project increases the likelihood of having to load more tapes for restores. So in practice you may want to spread the load of the different projects on several drives instead to have a better data locality and reduce the work at restore. So if you have no mirroring for example, you could write files of 4 different projects on one copytool at the same time. If phobosd is able to do this, you just need a tag aware locate for archives in the coordinatool. You can simply send all the writes you want (using the answer from locate) to the copytools without worrying about unnecessary load/unload. We already do this for reads. Doing it for writes is (maybe) not that hard.

Meanwhile, for restore we're also not doing enough: we're currently doing a phobos locate when a requests comes in from lustre (if the tape is already mounted, send this request in priority there) and when we're scheduling a request to an agent (if it's mounted elsewhere don't send the request there but queue it in priority for that other agent), but that has two problems:

* even if it's a priority list there's no guarantee the tape will still be mounted by the time it's sent

* if a tape is mounted after the request has already been queued, the coordinatool won't know about it and these requests will stay scattered

A couple of things. This is not accurate. phobos_locate will actually chose a node to send the request to. The idea is this:

The only "bad" thing that can happen (unless I've missed something which is not unlikely), is that the copytool unloads a tape before the coordinatool is able to send a new restore on the same tape. But this issue can still happen with the solution you describe. If you build a queue of restores for tape A on the coordinatool, then decided to send them to the copytool and just after that you receive a new restore on tape A, you might miss the opportunity to not unload this tape. At this point, we are trying to predict the future. Might as well ask chat GPT. :) The grouped_read algorithm will manage the biggest queues first which means that the time window to send new restore requests grows with the number of restores on the same tape. So there is probably enough time before the tape is unloaded. We can also implement a timeout to wait before unloading a tape if needed/useful.

To solve these problems I think the coordinatool shouldn't focus on movers, but sort requests by tape, so my current idea is pretty similar to the archive scheduler: more queues! one per tape!

* queue requests per tape when they come in, the tape queues are ordered like for archives. Ideally the requests within a queue are sorted by position on the tape? Does that still make sense with current technos (and do we have the position on LTFS in phobos?)

The order of the request within a queue should definitely be the responsibility of Phobos. You can use RAO on IBM drives on LTFS but the interface is not user friendly (you need to write the raw SCSI request in a file and give the path to that file in a special xattr then the result is in another file...). Getting the position of the tape on LTFS is a tough question though. :) I still don't know if it possible. In theory it's not always possible since files could be fragmented but Phobos makes sure this does not happen. So we might be able to find the positions. I know there are a few LTFS xattrs that can return some information. We could also query the drive's head position after or before a put but I don't know how accurate this would be. Also, sorting by tape offset is not optimal on current tapes. So this is not a good idea to sort them even if the information was there.

* when we start restoring from a tape assign it to a mover, mark the start timestamp and keep picking from the same n tapes (this must be configured: if we send only requests for a single tape then the mover won't be able to do other restores. Ideally we could ensure tags are different for QoS on restores, so a single project won't hog all drives, and given how minio works we also probably want to say that if tape for riskA,projectX,n0 is being mounted we want to priorize other riskA,projectX,n[1-3] tapes? That is probably overthinking? Will ignore that for now.)

* After batch duration, move tape back; ideally the batch duration would be such that we never need to do that.

If you need implement this in grouped_read, this doable. The current algorithm processes all the requests on one tape before picking the next one. It also inserts new requests in the current queue. So if you keep receiving requests for this tape, it will not be unloaded until you have received all of them.

courrierg commented 5 months ago

A final thought. Doing the scheduling logic outside of Phobos is risky. The relationship between an object and its tapes is managed by Phobos and depends on the layout. If you have mirroring for example, each object will have n > 1 tapes associated to it. So you need to handle this case in the coordinatool. Now you can also have splits in Phobos. Meaning that a big object can be split in two parts (or more in theory). All of this logic is managed by phobos_locate. Moving this externally is probably not a good idea as this will expose Phobos' internals outside of its API. At some point, Phobos will also support HSM features. An object could be on disk or on tapes. It could be moved while the coordinatool is doing its scheduling. Managing tags inside the coordinatool for archives is probably fine though. But in the long run, I think this is a risky path to take at least for reads.

I would advise against doing too much optimization for reads in the coordinatool. I think currently, Phobos does what you need. Maybe we need to add the "batch_timeout" feature. For writes, we can implement a scheduler that manages requests per tags to avoid unnecessary loads. We can probably also do the same reservation logic for tags as we did for tapes on restores. We also plan at some point to have the possibility to group puts on a tape. So you could say "all the files of this directory should be stored on the same tape".

martinetd commented 5 months ago

I shouldn't have sent archives / restores in the same post, this makes replies hard to follow :P Trying to reply to archives points in this reply, will reply to restores next.

One important thing to note is that Phobos' FIFO scheduler is not strictly FIFO. If a request cannot be handled right now, it will try the next one, then the next... So if you send two puts on tag p1 then one on tag p2, it is possible that Phobos will still do p1, p2 then p1 which would trigger a useless load/unload. This can easily be fixed with a flag in the configuration for example. But I think if you focus only on the coordinatool for the scheduling, you might have some issues with the current behavior of Phobos (which can be fixed of course).

I think that is fine in practice as long as batches are big enough and that not all tapes are reused, e.g. it's not p1, p2, p1 but p1 x100 then p2 x100 when there is no more p1 requests available. So at the edge it's possible a p1 tape will be umounted for a p2 tape while some writes to p1 still wait, but hopefully we can get phobos to wait for the last few p1 writes and send them to the mounted drives? (hm, I guess that's not implemented so we have no such guarantee right now, coordinatool could optimize this by waiting until only n (drives available for archive on that mover) requests are in progress on the mover before sending requests for the next batch but I think that's micro-optimizing too much and in the raid5 case won't work anyway (don't know how many drives are used); so it's something phobos will have to be able to handle if that becomes a problem: it should know that there are still requests that could be fulfilled by currently mounted tapes so should prioritize sending to the currently mounted tape imo)

if the coordinatool sent an archive request for project X in the middle of a batch of Ys then phobos will have no choice but to process it anyway, but if we can force the coordinatool to do archives in batch then we can use all the (archival) resources on a copytool without issue. OTOH we want to make sure the requests get processed before they time out, so some limit to each batches is in order.

Do you mean: if I have 3 puts on tag p1 and 2 on p2, the coordinatool sends the 3 puts then waits for them to finish then send the 2 for p2 then wait for them to finish... ? Or do you simply mean that you order the puts by project (and other tags)?

I didn't think of waiting to finish; we can do this if that becomes a problem in the future but for now just ordering is fine. The problem isn't that at any given time there are 3 (p1)/2 (p2) requests, it's that if you keep getting a stream of requests from robinhood policies currently running you'll keep getting p1/p2 requests while the current p1 requests are processed. So we should keep p2 requests on hold until we've processed say 1h of archives for p1, then switch to p2 and keep any new p1 requests on hold for a while. Of course if there is no work for p1 left because we archived fast enough then there's no problem and p2 requests can be sent immediately, but archive requests coming from policy should come faster than we can archive immediately, so the list should never become empty here -- if it does then it means the tape movers are idle and we can afford the tape mount time.

So to rephrase the scenario is more like:

A limit we can hit here is the max active requests lustre setting, this doesn't work as well if lustre cannot send all the pending requests to the coordinatool, but in this case there will still be a batching of the maximum number of requests at a time (if that is e.g. 10k and we switch because of it, we'll have at least 10k requests to process for other tag before coming back to p1), so for archives I think it's ok.

Another problem is the start of the batches; in practice I think you'll get the 100 p1 requests before you get any p2 requests, but if the policy runs start exactly at the same time you have few enough requests that the coordinatool might send them all (because there are enough movers available); in this case we might have something like send 3 p1 archives, 6 p2 archives, 9 p1 archive and finally by then we'd have enough requests pooled and a big batch of p2.. That will cause a few extra mounts, but I don't think that'll happen often in practice.

  • archives get queued in a different list per full hint (tag=riskA,tag=projectX,tag=n[0-3])

So each set of tags would mean a different list right? Not just a list per project for example. Meaning riskA,projectX,n0 riskB,projectX,n0, riskA,projectY,n0 and risk1,projectX,n1 would all be on different lists?

Yes, each would be different queues. My understanding is that they'll end up on different tapes (e.g. riskA would be normal writes and riskB would be mirror copy, so tapes won't be shared) the coordinatool cannot know how to split the different tags, we could teach it all the projects but that doesn't scale; I think it's best to regroup by fall tag:

  • the queues themselves are in an ordered list

  • when a copytool requests work, it picks work from the first n[0-3] matching queue in that list

What do you mean by this? What is "the first n[0-3] matching queue"? Matching with what? The copytool?

I am talking about the new archive_on_hosts tag=n0 mover1 setting in the coordinatool -- mover1 should only take requests for tag=n0, so if the first 4 lists are p1,n1 / p3,n4 / p2,n0 / p1,n0 then when we schedule work for mover1 it should skip the first two n1/n4 lists and take work from p2,n0 first. I worded it as a search for each requests but in order to avoid splitting a project needlessly I think a mover will take the whole list, e.g. once mover1 has started taking work from p2,n0 then other movers will not be able to take work from it. Right now stanford only has one mover per n[0-3] anyway, but even if there are more I think we'll want to prioritize keeping all p1 data on the same few tapes rather than spread them on multiple tapes (through multiple movers -- depending on the phobos algorithm that might also cause tape stealing from one mover to the other as well if tape criteria make a mover think that tape is better...); but that is a detail.

If you send a batch of puts to a copytool, you expect this batch to be stored on tapes that match the tags available. Since in Stéphane's configuration there are 4 drives per copytool, it means that your batch will be split roughly in 4 and written on 4 tapes using 4 drives (or less if there is mirroring involved). Depending on the use case, this may not be desirable. Spreading data across tapes for a single project increases the likelihood of having to load more tapes for restores. So in practice you may want to spread the load of the different projects on several drives instead to have a better data locality and reduce the work at restore. So if you have no mirroring for example, you could write files of 4 different projects on one copytool at the same time. If phobosd is able to do this, you just need a tag aware locate for archives in the coordinatool. You can simply send all the writes you want (using the answer from locate) to the copytools without worrying about unnecessary load/unload. We already do this for reads. Doing it for writes is (maybe) not that hard.

Hmm that is a good point, is it better to force a max number of drives per tags so we get better data locality or focus on write throughput... Right now phobos does not have enough logic on writes, so there is no choice anyway, but if phobos becomes able to schedule writes a bit more smartly we should add a locate-like call during archival with (tag,size) as well:

So I don't think it's incompatible, just a first step in the grand scheme: first some basic scheduling in coordinatool, then we can improve it.

martinetd commented 5 months ago

For restores:

A couple of things. This is not accurate. phobos_locate will actually chose a node to send the request to.

Ah, so this isn't so much a locate as a real scheduling call, that is good. In this case I agree phobos knows best, so the coordinatool implementation should be made simpler, but we still need some grouping imo; just with much shorter batch size than writes. The problem is that with restores if you have more requests on more tapes than currently processable, you'll still have to pick a mover multiple times, so we might end up with a queue like mover1: [file@tape1, file@tape2, file@tape3, file@tape1] -- even if the mover has been selected by locate we won't know about the file@tape1 hint and we'll cause that tape to umount/remount. So this is actually exactly like the "phase 2" of archivals, phobos locate should return not a mover but a (mover, tape ID, number of drives required) like for archival. The number of drives required here is a bit of a guess (for raid5 if a tape has problems it might get messy), but in general it should hold and allow scheduling just enough requests that can be processed in a generic way In case of HSM, number of drives required can be 0 if the file is on disk, at this point that request doesn't need to wait for tapes so can be sent at any time and that can get different scheduling again.. I don't think that'll be a problem, but I don't want to handle too much too fast, iterative steps that are simpler to implement is better.

Ultimately the problem is that phobos doesn't have a full view of the requests, so when I'm saying "implement in coordinatool" I really just mean that this should be done at coordinatool level where we know all the pending requests, but this can be hidden in a single phobos_locate() call; we just need to get more infos from it. If we can get phobos to drive the coordinatool scheduling more smartly then we don't need to leak too much implementation details.

Anyway, reads aren't a priority right now (all data is on lustre disk), I think we can discuss this again later after writes are done.

position on tape

right so LTFS hides that info as I expected, we can probably remember the order in which the files were written to the tape as an approximation but that is probably not worth worrying about right now anyway; also storing this for later. (Would be interested to compare how fast you can read e.g. 10 sequential files vs. same 10 files out of order on the tape first, e.g. just mount it and time cat > /dev/null in written order vs random order for a few randoms.. I don't have real hardware so the VTL isn't representative here)

If you need implement this in grouped_read, this doable. The current algorithm processes all the requests on one tape before picking the next one. It also inserts new requests in the current queue. So if you keep receiving requests for this tape, it will not be unloaded until you have received all of them.

I'll also check that before doing any work on reads; thanks.

courrierg commented 5 months ago

I think that is fine in practice as long as batches are big enough and that not all tapes are reused, e.g. it's not p1, p2, p1 but p1 x100 then p2 x100 when there is no more p1 requests available. So at the edge it's possible a p1 tape will be umounted for a p2 tape while some writes to p1 still wait, but hopefully we can get phobos to wait for the last few p1 writes and send them to the mounted drives? (hm, I guess that's not implemented so we have no such guarantee right now, coordinatool could optimize this by waiting until only n (drives available for archive on that mover) requests are in progress on the mover before sending requests for the next batch but I think that's micro-optimizing too much and in the raid5 case won't work anyway (don't know how many drives are used); so it's something phobos will have to be able to handle if that becomes a problem: it should know that there are still requests that could be fulfilled by currently mounted tapes so should prioritize sending to the currently mounted tape imo)

Reading your answer makes me realize that what I said is not entirely accurate. If you have tape A in drive 1 and tape B in drive 2, writes with tags for tape A will try to go to drive 1 (same for tape B). What can happen is when drive 2 finishes its current I/O, a new request will be scheduled on it. FIFO does not guarantee that the request will be for tags of tape B. It could be trying to schedule a write for tags of tape A, see that tape A is busy, see that drive 2 is not. Unload tape B from drive 2 then reload a new tape with the same tags as tape A. But the issue I mentioned still exists. We probably need to enforce a real FIFO at some point. Or have a better I/O scheduler for writes.

Another problem is the start of the batches; in practice I think you'll get the 100 p1 requests before you get any p2 requests, but if the policy runs start exactly at the same time you have few enough requests that the coordinatool might send them all (because there are enough movers available); in this case we might have something like send 3 p1 archives, 6 p2 archives, 9 p1 archive and finally by then we'd have enough requests pooled and a big batch of p2.. That will cause a few extra mounts, but I don't think that'll happen often in practice.

I definitely think we should implement something smarter at Phobos' level. These issues should not exist. You should be able to say "I know I have 4 drives on this copytool, I should be able to send 4 different sets of tags to this copytool". Or something like that. But given that Phobos doesn't do it yet, your solution seem to be a good workaround.

* ideally I'd even go further and try to regroup as much as possible (e.g. per directory on lustre), as restores will likely be grouped by directory as well

There is a feature that we want to implement in Phobos to support this. We call it "grouped put" for now. The idea is to be able to do exactly this. Group related objects on the smallest amount of tapes possible. How this will be implemented is still not clear yet. And how will the coordinatool integrate with this either.

I am talking about the new archive_on_hosts tag=n0 mover1 setting in the coordinatool -- mover1 should only take requests for tag=n0, so if the first 4 lists are p1,n1 / p3,n4 / p2,n0 / p1,n0 then when we schedule work for mover1 it should skip the first two n1/n4 lists and take work from p2,n0 first. I worded it as a search for each requests but in order to avoid splitting a project needlessly I think a mover will take the whole list, e.g. once mover1 has started taking work from p2,n0 then other movers will not be able to take work from it. Right now stanford only has one mover per n[0-3] anyway, but even if there are more I think we'll want to prioritize keeping all p1 data on the same few tapes rather than spread them on multiple tapes (through multiple movers -- depending on the phobos algorithm that might also cause tape stealing from one mover to the other as well if tape criteria make a mover think that tape is better...); but that is a detail.

This seems like a good strategy. For the tape stealing, Phobos will not swap tapes with another daemon as long as the tape is locked by the daemon (i.e. as long as the tape is loaded in a drive). What can happen though is that when copytool 1 decides to unload tape A, copytool 2 can choose this tape and reload it. The fact that copytool 2 chooses tape A over another one doesn't have a performance cost. If copytools favor tapes with the most size available, this won't happen in practice. Although this strategy has other issues. But it's as long to load tape A as any other tape for copytool 2. What would be better is that the coordinatool never sends requests that could be done on tape A to copytool 2. But if we have several movers for the same tags as you mentioned, this will not be possible. So this issue will likely always be there no matter how smart Phobos and the coordinatool are. And in practice, phobosd should only unload a tape when it is full. So the tape won't be chosen again by another copytool.

* we cannot guess parity (single/raid1/raid5) for a given tag, so cannot know how many queues a mover can handle; we want to avoid sending more work than can be writed in parallel to avoid remounts; so the locate call will need to return how many movers are consumed as well.

locate can return this kind of information but I feel like this is the wrong way to go. In the long run, the coordinatool should not have to worry about remounts. I understand that you want a solution that works now. But I don't think it is worth developping a locate at put which would return the number of drives involved. Ideally, what the coordinatool needs is an indication of the amount of work each copytool has. Then decide where to send the next request based on this. This will of course depend on the number of drives per copytool.

* the queues index will just change from the lustre hint (tag=risk,tag=proj,tag=n) to some selected tape ID returned by the locate (for raid1/raid5 I assume the tapes are grouped? so that can be a leader tape ID or something)

Not yet, but this is planned. For now, layouts can choose any tapes they want for an I/O. This will change (soon ?).

courrierg commented 5 months ago

Ah, so this isn't so much a locate as a real scheduling call, that is good.

Good point, we should probably rename it when we implement the put version. :)

In this case I agree phobos knows best, so the coordinatool implementation should be made simpler, but we still need some grouping imo; just with much shorter batch size than writes. The problem is that with restores if you have more requests on more tapes than currently processable, you'll still have to pick a mover multiple times, so we might end up with a queue like mover1: [file@tape1, file@tape2, file@tape3, file@tape1] -- even if the mover has been selected by locate we won't know about the file@tape1 hint and we'll cause that tape to umount/remount.

I'm not sure I understand what you are trying to say here. If you send all the restores you have to the copytool returned by locate, you should not have any useless mount/unmount. Unless of course if a restore arrives right after a tape is unloaded but this issue will always happen even at the coordinatool level. Now if the coordinatool tries to hold onto some requests to avoid overloading the copytool, then yes you might have issues with unnecessary mount/umount because this is not how the locate was designed. Now if that's really what you want to do, why not but the locate will need rework. In practice, I think the current locate + grouped read should prevent most unnecessary load/unload.

Ultimately the problem is that phobos doesn't have a full view of the requests, so when I'm saying "implement in coordinatool" I really just mean that this should be done at coordinatool level where we know all the pending requests, but this can be hidden in a single phobos_locate() call; we just need to get more infos from it. If we can get phobos to drive the coordinatool scheduling more smartly then we don't need to leak too much implementation details.

Yes, I agree. My current understanding is that you need what is available for reads also for writes (i.e. locate on tags + grouped write per tags).

right so LTFS hides that info as I expected, we can probably remember the order in which the files were written to the tape as an approximation but that is probably not worth worrying about right now anyway; also storing this for later. (Would be interested to compare how fast you can read e.g. 10 sequential files vs. same 10 files out of order on the tape first, e.g. just mount it and time cat > /dev/null in written order vs random order for a few randoms.. I don't have real hardware so the VTL isn't representative here)

My understanding of the literature is that sequential reads is not optimal and can even be near the worst case because of the way data is stored physically on the tape. I would guess that random and sequential are not that far apart. We also plan on working on this at some point but this is not a high priority for us right now.

martinetd commented 5 months ago

locate can return this kind of information but I feel like this is the wrong way to go. In the long run, the coordinatool should not have to worry about remounts. I understand that you want a solution that works now. But I don't think it is worth developping a locate at put which would return the number of drives involved. Ideally, what the coordinatool needs is an indication of the amount of work each copytool has. Then decide where to send the next request based on this. This will of course depend on the number of drives per copytool.

I think we're in agreement here (there is no need to keep separating the two), if coordinatool properly calls into phobos it's equivalent.

Long term, the coordinatool should call into phobos for scheduling; we don't want to configure everything (policies for each tag, number of drive on each mover, etc) -- what I've done with archive_on_hosts is just a first step because it was quick to do, and the first step I've suggested here (just group by tag) is basically a no-op scheduler, but I agree we want to call into phobos everytime and let it do its job. If phobos is called early it has access to the same informations as coordinatool wrt pending requests, so it can make the appropriate decisions of sending this put to these tapes on this mover early in coordinatool, and coordinatool/movers "just" have to abide by this.

Ultimately you're correct that coordinatool should not have to worry about the number of drives per mover/tapes per archive, I suggested that as the most straight-forward mapping for coordinatools' internal scheduling in the way I picture it, but just selecting the client directly at phobos level might be enough if the "phobos_schedule()" call can also say "don't schedule this request yet" and is also called again regularly on these tapes... The problem I see is that phobos has no way to call back into coordinatool to say "okay, now you can schedule this batch" next" at this point, so part of the scheduling currently has to be internal to coordinatool, but perhaps a few more hooks can work as well to avoid the double work.... Something akin to passing the request to an opaque phobos queue, and phobos deciding when to take it out of it into a mover when a mover becomes free? e.g.

Note this is different from enrich and having mover queues like we do for restores: having a mover queue is not flexible enough if you have multiple tags, as you could overflow a mover with requests for a single tag while having other tags in the backlog. phobos would need to be called everytime, and would have to ensure at least two requests for each tag are consistently scheduled to avoid down time between done and next recv.

Also note that if we do this kind of scheduling we don't even need to separate archive vs restore, this archive/restore QoS can and should also be done at phobos level so that when a mover asks for more work after doing an archive phobos could decide if it is more appropriate to handle a recv request or keep doing archives more smoothly.

If that sounds ok for you, I'll try to see if I can make the "one queue per lustre hint" v0 work match this API so we can plug in phobos directly later; internally this can be done exactly as I was describing previously, just hidden neatly behind scheduling calls and separate list nodes in structs a bit.

  • the queues index will just change from the lustre hint (tag=risk,tag=proj,tag=n) to some selected tape ID returned by the locate (for raid1/raid5 I assume the tapes are grouped? so that can be a leader tape ID or something)

Not yet, but this is planned. For now, layouts can choose any tapes they want for an I/O. This will change (soon ?).

That's very good to know, I think it's definitely better to group at tape level to ensure restores are grouped correctly (wouldn't want to have to mount 3 different tapes to restore files that were sent in a raid1 batch); Stanford doesn't use raid1/5 so this isn't an immediate problem here, and I'll leave this up to you before someone starts using these in production.

(Also not replying to the restore part yet, but I've read it; I want to focus on archives short term & in this issue. I'll open a new issue for restore to continue discussion once archives have taken shape)

martinetd commented 4 months ago

Sorry for the delay, I was thinking about what kind of hooks we need/want and how to manage the scheduling. It's not as obvious as I thought, but can probably work out... I think?

long post warning, mostly to organize my own thoughts; feel free to skip straight to the end.

current behaviour

Full recap of current queuing/scheduling

Here's what we have currently for queuing

  1. request comes in (hsm_action_enqueue() from lustre or hsm_action_enqueue_json() from redis on recovery)
  2. hsm_action_enqueue_common() calls hsm_action_node_enrich(), which will add the request either directly to a client queue or the global queue a. if mappings are set, apply mapping to archive request b. otherwise phobos_enrich(): calls phobos_locate() with oid... This tickles down to layout_raid1_locate() to pick the best host to use (unrelated note: raid5 has no locate and layout_locate will segfault if called?)

Scheduling is done on these conditions:

The scheduling itself does:

data structures

as far as scheduling goes, we have:

state (available everywhere) has:

each client also has:

each hsm request (hsm_action_node) has some details used in case of cancel and embedded list node, but nothing really related to scheduling.

current hooks

as listed above we only have hsm_action_node_enrich(hsm_action_node) at enqueue time and immediately when before a request is sent, and schedule_can_send(client, hsm_action_node) after re-enriching/before sending.

forward

problems

The current approach has a few limitations (we're not addressing everything here):

plan

Given the above I don't think it's a good idea to try to implement some 100% custom scheduling, e.g. adding opaque data to structs and making hooks manage it is much more complexity than it's worth.

1. simple batching

For batching we don't need to modify the queues as we'll keep it simple:

That should be straightforward enough to implement quickly and won't be too intrusive in the code -- in particular this is a bit less efficient than the "one queue per lustre hint" I had suggested in my previous comment, but it's less code and I don't think that'll be a problem in practice given we'll only need to go through the whole lists if no other work is happening or once per deadline, so it should be acceptable even if the lists grow very big.

2. phobos scheduling for archives

We talked about a phobos_locate() for put; I still think it makes sense in theory.. But ultimately I don't see the benefit. Do we get something from phobos being able to schedule puts at this point?

If we get path on lustre as part of the enrichment and start more complex "grouping by directories" it might be worth it, as phobos will be able to get back the tape that was used for that directory in an earlier batch that just coordinatool won't be able to do, but right now I don't see any benefit. Even if we do grouping by directory, I think the grouping by tag can still be left to coordinatool to avoid db round-trips in a first step.

So timeline would be something like:

3. improvement of phobos scheduling for restores

[off topic] I had misunderstood https://github.com/phobos-storage/phobos/issues/10#issuecomment-2166280345 's explanation of phobos_locate() to be what it currently does, but it currently doesn't reserve a tape/mover if the tape isn't mounted anywhere (layout_raid1_locate() is way too complicated...); so I think my comment still stands - we're not considering drive-by mounts, e.g. this scenario:

With the current code we'd basically need the same as the "when we pick a new tag go through all global queue requests" of the simple batching for archives. Not having thought through this at all I think that might be good enough, and that sounds better to me than allocating a mover early on during initial request enqueue - we should keep 'free' requests in the global queue for as long as possible so they can get scheduled to the first available mover ASAP.

Like puts, having restores grouped by tape group would allow more efficient can send checks, so returning that info instead of a host name for the first enrich is probably better. The later enrich can and probably should stay to check tape wasn't mounted elsewhere meanwhile for some reason, it's less work to check all the time than make a mistake.

[/off topic]

Anyway this point is far off the road, I'm just dumping my thought for myself later, let's rediscuss restore improvements later.

So tl;dr: