oxinabox / DataDeps.jl

reproducible data setup for reproducible science
Other
150 stars 43 forks source link

Garbage collection functionality #150

Open rofinn opened 2 years ago

rofinn commented 2 years ago

It'd be really cool if DataDeps had a gc function which could identify old datadeps that aren't referenced/needed anymore and suggest deleting them. I'm not quite sure how it work, but it wouldn't need to be perfect. It could just make a best guess, ask the user, and they can always redownload things they actually need.

oxinabox commented 2 years ago

Yes, I have some thoughts on this:

Some options:

  1. just use Artifacts.jl instead of DataDeps.jl.
  2. We workout how to hook into what Artifacts use for this.
  3. Do something like generate Artifacts on the fly from DataDep after they are downloaded.
rofinn commented 2 years ago

Do you mean Artifacts.jl in Pkg.jl? FWIW, I do think the flexibility (e.g., fetch_method, post_fetch_method) and macro are probably worth keeping in our use case.

oxinabox commented 2 years ago

Yeah, is Artfiacts part of Pkg right now? I thought it was a seperate stdlib.

I do think the flexibility (e.g., fetch_method, post_fetch_method) and macro are probably worth keeping in our use case.

Agreed. But option 2 or 3 might let us still do that.

oxinabox commented 2 years ago

I think maybe the dynamic artifact creation in OhMyArtifacts.jl might work to achieve option 3. but I haven't looked closely. (cc @chengchingwen)

chengchingwen commented 2 years ago

I think it's doable but would take some effort to integrate the two. There are some problems and behavior differences need to be resolved. Things like:

  1. What do we want to bind the life-time of a DataDep to? OhMyArtifacts.jl (also Scratch.jl) bind to the Project(.toml) who call it, so if we want to build on top of that, we would need to figure out who defined that DataDep. Otherwise all the DataDep will live until DataDeps.jl been removed.
  2. It doesn't seem to be possible to migrate the old downloaded DataDep, should we just delete them all?
  3. DataDep name need to be unique, but entry name in Artifacts.jl can bind to different artifact.
  4. Currently each entry in OhMyArtifacts.jl is a single file (instead of a directory), I would need to come up with a better design for holding the directory structure.

OTOH using Scratch.jl might probably be the most effortless and behavior consistent approach (though 1. 2. remains), we can just store each DataDep in a scratchspace folder and bind the lifetime to the caller project with get_scratch!(DataDeps, datadep_name, THAT_PACKAGE) and somehow allow DataDeps.register or DataDep constructor to pass in the caller package. Then when that package get GCed, the scratchspace will also be removed.

chengchingwen commented 2 years ago

@oxinabox I just tag a new release for OhMyArtifacts.jl. It should be able to handle directory now. I can try to make a PR to switch the backend to it, do you have a pointer on what files I would need to modified?

oxinabox commented 1 year ago
  1. We can work this out by using the __location__ from where the datadep macro is called I think then searching for a Project.toml. Though really i guess we want the location where register was called. For hacks we could use the stacktrace() and go find it?

  2. No leave them alone. But we could put something in the release notes

  3. It is a flaw in DataDeps that they need to be globally unique, since it means clashes are possible. Unlike with the smarter content hashing that Artifacts use. One thing we do need to preserve is the ability to access datadeps via a global name (though that is allowed to error when there is a clash I guess), since i am sure some people install a package and then access the datadeps it registers by name.

Maybe scratch spaces, as you say are the better way to do this.