lbryio / lbry-sdk

The LBRY SDK for building decentralized, censorship resistant, monetized, digital content apps.
https://lbry.com
MIT License
7.19k stars 483 forks source link

Disk space management #1311

Closed kauffj closed 2 years ago

kauffj commented 6 years ago

As a user, particularly a mobile user, I want to be able to allocate a maximum amount of disk space used by the daemon.

The daemon should automatically manage my files and blobs to not exceed using this much space.

1311 is related.

tzarebczan commented 6 years ago

https://github.com/lbryio/lbry/issues/1171 is also related.

eukreign commented 6 years ago

@sayplastic is working on this for his first PR.

anbsky commented 6 years ago

@kauffj, the way I understand this task, the simplest working solution for now would be:

I'm planning on working off lbryum-refactor branch, as per @eukreign suggestion.

Please let me know if this works for you or if anything can be improved.

kauffj commented 6 years ago

@eukreign should set specification here.

However, the intention of this setting is not to error if there is not enough space, but instead to keep space usage below this figure and only error if it cannot figure out how to do so.

I suspect the code for handling this will end up in the blob and/or file manager.

eukreign commented 6 years ago

@kauffj I discussed this with @jackrobison at the standup today; there isn't an obvious way to determine which files/blobs can be deleted (too many edge cases to consider). Sounds like the ultimate solution is to generate blobs on the fly, this alone will cut storage in half across the board for everyone (but I'm not sure it's fair to assign something like this as first trial task for @sayplastic, although it would definitely make for some very extra bonus points if he could do it).

I think a more straightforward solution would be to at least delete blobs when the corresponding file is no longer in Downloads directory (either it was deleted, moved or renamed). This should still be behind a configuration flag because some people just want to seed and don't necessarily care to have a copy of the combined file in their Downloads directory.

kauffj commented 6 years ago

@eukreign then I would suggest not pursuing this and instead focusing on fixing whatever underlying architecture issues prevent this from being a reality. Please open those and mark this as blocked.

I don't really see much of the point in pursuing alternatives that only offer a small portion of the benefit compared to this, especially when this feature will clearly be required eventually anyway.

kauffj commented 6 years ago

See https://github.com/lbryio/lbry/issues/1311 for example of users struggling with this.

Also came up at UNH hackathon, someone based their entire presentation on improving this.

osilkin98 commented 5 years ago

When it downloads you can just have it keep track of the amount of memory taken up in total and when you're downloading something new you can just write the data to a file buffer in chunks of a certain size (the size can be variable based on a user preference, written file size, etc.) and before writing it to the file, simply check if the chunk along with used memory will exceed a set preference. If so then handle the error accordingly by cleaning the file up and letting the app know that the content exceeds the user's memory setting.

There should also be two settings implemented with this in order to properly handle running out of memory:

  1. Content Files are Temporary, and get deleted as soon as the user stops viewing them. This specific functionality should be replaced once I am able to implement a method of streaming content rather than downloading it.

  2. Files are cached and old files are removed when the maximum amount of memory is exceeded. This cache can be implemented using a priority queue which has the file with the oldest date as having the highest priority for removal. A Hash table is used for checking whether or not a piece of content the user wants to view is already downloaded or not.

eukreign commented 5 years ago

@osilkin98

  1. The daemon doesn't know when a file is being watched or if user is done with the file. As for streaming, we are already working on a streaming solution specifically for videos (streaming doesn't solve the issue for all other file types though).
  2. Something along those lines of deleting files that haven't been accessed is what we're considering but this needs a lot more thought to cover various edge cases. We haven't settled on a good general solution for this yet.

If you want to start on this general problem I think the most useful feature to start with is a new daemon command which returns the total space used by files and blobs. Please see the last comment in this issue: https://github.com/lbryio/lbry/issues/1171#issuecomment-437423320

osilkin98 commented 5 years ago

@eukreign The daemon doesn't have to be and really shouldn't be responsible for that. I feel it's the user's responsibility to take care of unwanted files. Instead what could be done is the daemon could be given the space available as a command when downloading a new piece of content, which it ensures not to exceed. A callback function could be provided to the daemon to be used for when the amount of space exceeds a certain amount, which would just be defined by the user as a means to track files.

The callback function, really, could just be private to the app itself, but if one were at the command-line, they could provide their own.

eukreign commented 5 years ago

@osilkin98 The daemon and the desktop app are running in different processes and communicating via HTTP, how would you do a callback function in that context?

Also, why is it the responsibility of the desktop app to figure out available disk space instead of the daemon? I believe the desktop app should be primarily concerned with UI and not with calculating available disk space.

osilkin98 commented 5 years ago

@eukreign If the desktop app is just a user interface for the daemon, then the file tracking can be done by the daemon. We can avoid the problem of needing to know if the user is watching the content or not by keeping track of all the files downloaded by the daemon, and simply deleting the oldest content files when more space is needed. If the user is using an application and they're trying to download a piece of content, the assumption that they aren't currently using the oldest file downloaded is very reasonable.

eukreign commented 5 years ago

@osilkin98 it gets more complicated because:

  1. we want popular files to stay so that the DHT/P2P network is healthy and performant
  2. users may want to keep certain files to watch again (without internet connection) and would be upset if we delete any/all old content
  3. users likely want to keep content they themselves published
  4. how do you determine what "old" means in terms of deleting old files?
  5. once you have definition for "old", do you delete all "old" content or just some of it to make sure there is space, and if you only delete some percentage of it, do you delete blobs first or files first until reaching that percent of available limit?
  6. what should be configurable by the user in terms of space management strategies and what default space management strategies are best and most user friendly?
osilkin98 commented 5 years ago

@eukreign You bring up valid points but here's the thing. What I'm talking about only applies if the user specifies that they want files to be treated as temporary.

  1. The files would stay on the user's computer until they get deleted if they have the temporary setting enabled. The user also may not want to act as a host for whatever reason, and it should really be up to them as to whether or not they want to be doing that. Ideally, people who are hosting shouldn't have this setting on, but it's not a big issue if they do.
  2. If a user has the temporary setting enabled, then they're already anticipating that files will be getting deleted, therefore this is irrelevant.
  3. User published content isn't considered a temporary downloaded file.
  4. We use a priority queue that uses the file downloaded/last-accessed date as the key, I explained this above. If the file is accessed either by the user viewing it again, or by the DHT accessing it, then we update its date and its key in the priority queue.
  5. You delete only the amount you need.
  6. Whether or not files are treated as temporary, and the maximum size files can take up. That's it. The space management strategies I already discussed above. You can also prompt the user to compress the files, however that would cause a lot of overhead if the DHT needs to access them, and so compression should only be accessible to people who aren't hosting data; which is completely fine to assume because if someone already needs to conserve space, they likely aren't going to be wanting to host to begin with.
alyssaoc commented 5 years ago

related #586

kauffj commented 5 years ago

I think we should strongly consider this a requirement for the public mobile app.

alyssaoc commented 5 years ago

@eukreign we need to solve this before we release mobile app 1.0 and we need to release 1.0 yesterday

eukreign commented 5 years ago

I don't think we can get this done by yesterday but I can work on this after we merge the asyncio branch.

kauffj commented 5 years ago

The first version of this can very basic. It is not a concern if the algorithm that chooses which blobs/files to keep is dumb.

lyoshenka commented 5 years ago

this only has one issue in it. i'm closing it

tzarebczan commented 5 years ago

Reopening - will attach related issues.

belikor commented 3 years ago

As commented in https://github.com/lbryio/lbry-desktop/issues/4634

This issue is partially taken care of by my library, lbrytools.

It basically inspects the top level directory of the subdirectories that hold the media files and blobfiles. If it crosses a limit in gigabytes, it will start cleaning up older files. It can delete media files (mp4, mkv, etc.), blobs, or both.

lbrytools.cleanup_space(main_dir="/home/user", size=1000, percent=90, what="media")
  1. users likely want to keep content they themselves published

Use a list of channels to never delete content from.

never_delete = [
    "@lbry",
    "@Odysee",
    "@samtime",
    "@RobBraxmanTech"
]
lbrytools.cleanup_space(main_dir="/home/user", never_delete=never_delete)

Probably another list for claims can be used; that is, these videos won't be deleted, regardless of author. This is currently not implemented in my tools.

  1. how do you determine what "old" means in terms of deleting old files?

Chronological order, by release_time or timestamp if the first is unavailable, as it happens in older streams.

  1. once you have definition for "old", do you delete all "old" content or just some of it to make sure there is space, and if you only delete some percentage of it, do you delete blobs first or files first until reaching that percent of available limit?

Since the media files can be recreated from the blobs, we should delete the media files first; if it fails to clear enough space, then the blobs should be deleted. To clear the most space, both should be deleted.

lbrytools.cleanup_space(..., what="media")
lbrytools.cleanup_space(..., what="blobs")
lbrytools.cleanup_space(..., what="both")
  1. what should be configurable by the user in terms of space management strategies and what default space management strategies are best and most user friendly?

As much configuration as possible. At the moment I consider location of parent directory or partition, size in gigabytes, and percentage of use (90%). The cleanup will be done if the content goes above the percentage, and it should never cross the disk size, as we assume this is a physical limitation.

lbrytools.measure_usage(main_dir="/opt", size=1000, percent=90)
lbrytools.cleanup_space(main_dir="/opt", size=1000, percent=90, what="both")

By using different values of size and percentage we can test how this function works in many situations.

It seems to work okay, but probably more tests need to be done, if many claims are downloaded and the disk suddenly becomes full.

lyoshenka commented 3 years ago

Something like this would be a Component. Take a look at https://github.com/lbryio/lbry-sdk/blob/master/lbry/extras/daemon/components.py for some components we already have. These are started when the daemon starts up (see daemon.py).

The conf setting itself would go into https://github.com/lbryio/lbry-sdk/blob/master/lbry/conf.py. Then the app could expose that somewhere on the Settings page.

Some of your other code might be useful as scripts (in the scripts/ dir) but that's outside the scope of this particular issue.

shyba commented 2 years ago

I think this can be closed. The main part of monitoring and cleaning is done for blobs, which is safer since it lives inside the SDK data folder.

IMO, when you click download for real (it isn't the default anymore) and get a real file on Downloads folder, it is now your file and you need to manage that using your OS features. It would be weird to delete files from Downloads folder automatically. However, just showing the usage should be good, which makes me think #1171 should be updated for reporting total file sizes, as we do currently to total blob space.

That said, if we really want the same for downloaded files I think we should update desktop#4634 so it becomes a feature request for the full file case. I think the same applies to feature requests that are extras or need discussion, such as new eviction policies, pinning files, what to do when it is full during a download, etc.

lyoshenka commented 2 years ago

what does "download for real" mean?

i mostly agree with you. more generally, we keep running into this problem of having two copies of every piece of data: the blobs, and the file. whats the general solution to that? should we stop storing blobs at all and just store the file (this is what torrent apps do). should we stop storing the files and only store the blobs, and you have to actively request to save a decrypted file to some external location that the SDK does not manage? something else?

doing the former means the SDK has a narrower scope. it just downloads and seeds content. doing the latter means the SDK is also a file viewer/player, or at least the app must be.

shyba commented 2 years ago

"download for real" means calling file_save or setting save_files to true, which creates a normal files on Downloads folder.

should we stop storing the files and only store the blobs, and you have to actively request to save a decrypted file to some external location that the SDK does not manage?

From my understanding, this is the current behavior as save_files defaults to false. We are also able to stream from the blobs and the app plays from that.