holepunchto / hyperdrive

Hyperdrive is a secure, real time distributed file system
Apache License 2.0
1.86k stars 135 forks source link

Expose `undownload` API #228

Closed RangerMauve closed 4 years ago

RangerMauve commented 6 years ago

As discussed on fritter and probably elsewhere, it'd be nice if there was an easy way to "undownload" certain files from the local storage.

A great use case for this is removing old files from your cache that you're unlikely to use anymore.

I believe a useful API would look like hyperdrive.undownload(path, version?) where path is the path to the file or folder that should be removed, and version is optionally the point in history to remove the file from.

If version history is supplied, go through the dat history from that point and find the latest time that the file or folder was modified. Then use this.content.undownload to remove it (or the folder contents) from the content feed.

If version history isn't supplied, traverse the entire history and undownload every instance of the file.

This would be useful to have in the DatArchive API in Beaker in the future so that applications could provide users with control over which data should be stored locally.

RangerMauve commented 6 years ago

Err, looking at the existing API it should probably be hyperdrive.checkout().undownload(path) instead. Where you undownload all the content up to the currently checked out version.

RangerMauve commented 6 years ago

Sorry for the spam, but would a PR be welcome for this functionality?

pfrazee commented 6 years ago

Name

We should call it something else because we have hypercore.undownload() which simply means cancel the active download -- which is also any API we could use. Maybe .clearCache() ?

Perms / Risk

Do you think there are any security policy or management concerns to consider? I suppose the concern applies to download() too: are there times when any app having access to these APIs could just be an enormous headache, because of rampant download/upload calls?

Since we don't really have a good answer for download() then maybe we shouldn't let it hold up undownload() either for now.

RangerMauve commented 6 years ago

hypercore.undownload() which simply means cancel the active download

Whoops, I totally misunderstood what that does. Shoulda looked at the hypercore docs instead of skimming the hyperdrive code. 😅

Regarding the name, I still like undownload because it's obvious that it's the opposite of hyperdrive.download(). Though I agree that clearCache or similar would be better in order to stay aligned with the hypercore naming conventions.

I agree we shouldn't need to be any special concerns for this for now since there aren't any for download(). I could see there being trouble if multiple applications are trying to do something fancy with the same archive, though.

As an aside, I think this will be useful to have for Beaker itself so that you could include a UI for managing cached content at a more fine-grained level.

pfrazee commented 6 years ago

Fair point about the name. I'll think about it. @mafintosh do you have any preference on this?

I agree that a Beaker UI for managing this would be good. Version & history management is really minimal right now, and a lot of that has to do with versions being really early. We know there's going to be more complexity once multiwriter lands, and we know we want tagged versions, so it feels premature to dive into that now.

mafintosh commented 6 years ago

In hypercore we use .clear(range) for clearing downloaded data. Using that name or clearCache sounds good to me if this actually clearing data

RangerMauve commented 6 years ago

Kinda off topic, but are multiwriter dats still going to have a linear history from whatever the current checkout is?

pfrazee commented 6 years ago

@RangerMauve ish. Multiple linears that weave together. The version number becomes a version vector.

RangerMauve commented 6 years ago

Multiple linears that weave together. The version number becomes a version vector.

I gotta say, that's cyberpunk as heck. 😂

So, would a PR for clearCache(path?) be welcome, or would y'all prefer to do it yourselves?

pfrazee commented 6 years ago

@RangerMauve 😎 aw yeah

Okay let me think. The question of how to clear historic versions (or not) seems important. I feel like it'd be useful to clear the cache of a file/folder/all for...

Using archive.checkout(v).clearCache() satisfies the last one but not the first two.

Perhaps we support the versioned cache-clear, PLUS we add two params for specifying whether the clear should delete the given version's cache and/or the previous versions:

archive.clearCache('/', {current: true, historic: true})

These would both default to true. If current == false then the given checkout is preserved. If historic == false then previous versions are preserved. Future versions are never deleted.

That seem decent?

RangerMauve commented 6 years ago

I'm not a huge fan of having four different types of clearCache, but after thinking about it more, I think it'll be more annoying to not have these two flags. Mainly because AFAIK it's hard to get the current (or previous) version number for a given file without traversing history, so you won't be able to easily represent "delete everything from the previous version and before".

pfrazee commented 6 years ago

@RangerMauve Yeah that's exactly what went through my head as I wrote it. One alternative is to have it work like hypercore's clear() where you pass in revision numbers/ranges. I'm not sure how that would work with multiwriter though, so I'm inclined to just roll with this

pfrazee commented 6 years ago

@RangerMauve I'm 👍 to a PR for this, or I can put it on my backlog. The changes will need to be made in https://github.com/beakerbrowser/beaker-core

RangerMauve commented 6 years ago

Cool, I was thinking it'd make sense to add it to hyperdrive first (since it'd be useful for non-beaker applications). I'm not sure when I'll be able to work on it, so let's race to see who has time first. :P

pfrazee commented 6 years ago

@RangerMauve oh that makes sense. In that case, I'd ask @mafintosh what function signature he wants for this, and then we can do our own thing with Beaker's web API.

RangerMauve commented 6 years ago

@pfrazee One concern relating to beaker integration: I think it should either fail or be a noop for dats that are being seeded.

pfrazee commented 6 years ago

@RangerMauve Dats that are being seeded have an "auto download latest" policy set. We can have the clear be a noop for latest but work on history

RangerMauve commented 5 years ago

Ok, one more. If you created the dat, should we allow things with read access clear local storage? (since that could potentially delete data forever if you aren't using hashbase)

pfrazee commented 5 years ago

@RangerMauve probably not, for now.

The more I think about this, the more I feel local cache control as an API is wrong (including the existing download() API). The intention, I think, is to control

1) what's reliably available, and 2) what's fast to access.

The former intention needs something more sophisticated than local cache control, because (as you mention) over time we'll be integrating remote services which provide availability to data even the dat is locally-owned. We'll effectively begin doing what https://github.com/KonstantinSchubert/zero does. For the latter, I feel like we need more time to understand apps & dat before we can design good APIs to solve.

RangerMauve commented 5 years ago

@pfrazee I agree with waiting to understand apps more. In the meantime it might be better for Beaker to provide more fine-grained control over what data is downloaded.

Would this functionality still be useful for hyperdrive? I think it would be useful for people attempting more advanced use-cases using the cli and also for making life easier for Beaker when the time comes.

pfrazee commented 5 years ago

@RangerMauve I guess the question I'd ask is, what are we trying to accomplish right now? If the answer is to reduce the cache usage, then maybe we ought to be talking about adding more sophistication to Beaker's internal cache management rather than pushing it into userland

RangerMauve commented 5 years ago

I'm coming here from talking to people on IRC. I think the main concerns I saw were related to the CLI and not even beaker-specific. They wanted to be able to undownload parts of dats that the have saved locally which they no longer needed.

In the beaker space, I don't think I've seen as much buzz, but people wanted to clear data from their device that would be replicated elsewhere. I think that this would be better left to be figured out by applications that manage your files.

pfrazee commented 5 years ago

@RangerMauve okay that makes sense, yes. Beaker would need such an api anyway.

I originally thought this issue was on a Beaker repo so I may have misunderstood

gmaclennan commented 4 years ago

Hi! Any further thoughts about this? It seems like what would be useful is an equivalent of hypercore.clear() with a way of specifying "clear all versions of file at this path" or "clear everything but the latest version of the file at this path". This would be really useful for replicating hyperdrives with frequently updated files, so that peers do not need to keep a complete history of old versions of a file in their local cache.

kevinejohn commented 4 years ago

I'm interested in this feature as well. Any updates on how to clear specific files or directories?

RangerMauve commented 4 years ago

You can clear a file by getting it's stat object to get the start and end index for where it is in the content feed. From there you'll want to use some undocumented APIs to get the content feed (this can break any time) and use content.clear(start, end) to for that file.

Here's an example of me getting the stat for the file, getting it's content feed, and checking if a range has been downloaded: https://github.com/RangerMauve/hyperdrive-is-downloaded/blob/master/index.js#L4

This should be easy enough to adapt to invoke .clear instead of .has.

RangerMauve commented 4 years ago

I don't have the time at the moment to push this through, but I think this is a super important feature to have and I think it's something that was present in the past.

okdistribute commented 4 years ago

I opened a PR to solve this issue, although I'm calling it clear rather than undownload because that is more accurate to the underlying datastructure. undownload is already taken for a different function!