Open casperdcl opened 2 years ago
Does this really belong to the CML repository? 🤔
Well even though implementation would need to be done in TPI, it would also need exposure in CML as well (unless we make it default behaviour).
if custom images are/become supported would that be better to handle on the image front?
images could handle formatting but I'm not sure whether that would be enough - surely the underlying filesystem needs to support it (i.e. be block-like)?
I don't follow, can't you build an image using a different FS? Yes, the fs needs to support reflinks for DVC to take advantage, setting this up makes sense to me on long-lived systems to reduce disk space consumed, but does this make sense for a cml runner use where the main feature is the ephemeral aspect of the training instance?
This belongs to terraform-provider-iterative
Note: as per https://github.com/iterative/cml/issues/561#issuecomment-871019350, we've chosen to use object storage instead of block storage for caching.
if custom images are/become supported would that be better to handle on the image front?
Yes, if you're willing to build custom images and use the same disk for both operating system and data. We already support this scenario.
[...] _surely the underlying filesystem needs to support it (i.e. be block-like)?
Block filesystems will only work in block devices... or in loop devices. 🤔 What other scenarios do you have in mind? Putting a block filesystem on top of object storage? 🙃
I don't follow, can't you build an image using a different FS? Yes, the fs needs to support reflinks for DVC to take advantage, setting this up makes sense to me on long-lived systems to reduce disk space consumed, but does this make sense for a cml runner use where the main feature is the ephemeral aspect of the training instance?
For a CML runner, no this isn't a requirement. But that's not the use case we are talking about here. In DVC, we would like to be able to start a (potentially long-lived) machine and run a lot of DVC experiments on that machine that will all share a common cache and would benefit from being able to reflink to/from that common cache. So yes, having access to standardized images that use btrfs/xfs instead of ext4 would be very nice to have.
Yes, if you're willing to build custom images and use the same disk for both operating system and data. We already support this scenario.
If we (on the DVC side) need to figure out how to make our own default images for each (aws/gcs/etc) platform then we can do that, but this is still something I would expect to be provided by TPI
And just to be clear, I understand that block volumes can only be attached to a single machine instance at a time. I'm not talking about having multiple machine instances sharing a DVC cache. I'm talking about having multiple jobs running (either sequentially or in parallel) on the single machine instance, and being able to take advantage of having an FS that supports reflinks on that single instance.
I guess we have really enjoyed how independent each of these tools are and I could see how dvc exp
and TPI's task
could work together, I'm missing the iterative vision for this is really trying to accomplish.
IMO this is reaching beyond what a "terraform provider" should try and do. If I was to set up a long-lived instance for a Data Scientist I would probably use the regular aws/gcp terraform provider and configure it with something like ansible and they can use a remote connection with vscode.
I guess having these premade images would be nice and then having an additional data disk that you keep past the life of the instance which you can mount to another machine later could be nice, but keeping large EBS disks can get pretty pricey over the long term.
I guess we have really enjoyed how independent each of these tools are and I could see how
dvc exp
and TPI'stask
could work together, I'm missing the iterative vision for this is really trying to accomplish.
This does not necessarily have to be provided by iterative_task
.
IMO this is reaching beyond what a "terraform provider" should try and do. If I was to set up a long-lived instance for a Data Scientist I would probably use the regular aws/gcp terraform provider and configure it with something like ansible and they can use a remote connection with vscode.
This is also more along the lines of what the DVC team was thinking. So terraform-provider-iterative would be used to define the cloud agnostic config for provisioning standardized machines across aws/gcp/etc (i.e. using iterative_machine
strictly for machine provisioning, startup and teardown, but doing everything else separately from terraform)
I think this can be wrapped up as a feature for mounting cache/persistent volumes. Doesn't quite enable what was described by @pmrowla but is a step in that direction.
Usual default is
ext4
which doesn't support reflinks. Would be great to make it easy to choose (or maybe default to) a different system.Probably needs block-based storage (e.g. AWS EBS & formatting).