oxinabox / DataDeps.jl

reproducible data setup for reproducible science
Other
150 stars 43 forks source link

Google Drive download helper #136

Closed heinrichreimer closed 1 year ago

heinrichreimer commented 3 years ago

Often datasets are distributed on Google Drive. That's an issue because Google requires confirming downloading for large files (i.e., on which they don't scan malware). Transformers.jl already has a custom fetch_method implementation for that case. So I wonder if it might be worth including that helper method in DataDeps.jl, possibly integrating it without having to use fetch_method at all.

oxinabox commented 3 years ago

Yes, downloading things from google drive is a thing people do.

Embeddings,jl uses GoogleDrive.jl similarly. I think it is broadly similar to the code that is inside Transformers.jl https://github.com/JuliaText/Embeddings.jl/blob/306c04bead62b32873dedbc2609c74c4ca34306b/src/Paragram.jl#L31

I don't see any reason to have it in this package. More useful to have it in another suitable package (like GoogleDrive.jl, or some new package if you want to start from scratch) that can do this and likely more (e.g. writing). When those can work with DataDeps.jl

That could look like AWSS3.jl which provides the S3Path type, which works with DataDeps without needed to specifiy fetch_method because it overloads Base.basename and Base.download. These two things are all that is required to work with DataDeps without a fetch method: https://github.com/oxinabox/DataDeps.jl/blob/85f28c1a3e577c892a2fde6a40bab3f1ab6de451/src/fetch_helpers.jl#L51-L60 More broadly: It would be really cool if someone overloaded the FilePathsBase API for Google Drive.

Other reason i wouldn't want it here is I don't want to take on dependencies nor do i want to take on maintance burden.

heinrichreimer commented 3 years ago

So possibly a GoogleDriveFile("0B9w48e1rj-MOLVdZRzFfTlNsem8") struct could be added to GoogleDrive.jl with overriding Base.basename and Base.download? If that would then work out-of-the-box with DataDeps.jl, I would agree that should be the preferred way.

oxinabox commented 3 years ago

yeah that would be great

ggebbie commented 2 years ago

Thank you for discussing the issue when downloading large files from Google Drive. I also think it would be really cool to add this code to GoogleDrive.jl as I can't even download a 43 MB file without virus scanner interference. I have looked at the suggestions above, but I didn't immediately grasp how to do this coding myself.

heinrichreimer commented 1 year ago

Closing due to inactivity.