Use cloudpathlib for the storage API?

lithops-cloud / lithops

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀

http://lithops.cloud

Apache License 2.0

317 stars 105 forks source link

Use cloudpathlib for the storage API? #1386

Open TomNicholas opened 2 months ago

TomNicholas commented 2 months ago

@cisaacstern and I were wondering what was the rationale of the lithops project making your own implementation for the lithops cloud proxy storage API.

There are other libraries that deal with this problem already - why not use one of them?

In particular one I like is cloudpathlib, which provides classes that deliberately follow the same interface as the python pathlib standard library module, but for different cloud storage providers. It seems to me that your lithops.storage.CloudFileProxy class could possibly just be replaced with cloudpathlib.CloudPath?

JosepSampe commented 2 months ago

Hi @TomNicholas, nice catch! Just note that this API has always been treated as a research prototype due to the ease of use provided by Lithops. In Lithops, we've always pursued simplicity in cloud usage. The main API has consistently been the Storage API, but we've realized that users who want to use Lithops can easily transition to an API they already know, that's why we implemented it.

You may have noticed that we also implemented the multiprocessing.Process API. Other projects like Ray or Dask have also implemented similar functionalities, but this doesn't mean we cannot implement our own. There are likely tens of frameworks that do similar things, but with Lithops, it's quite easy to implement and allows users to change just one line of code without learning a new API.

When we began developing the cloud proxy internally in 2019, the framework you mentioned didn't exist. Our intention was simply to provide more APIs familiar to users for adoption. I agree the API you mentioned looks good, but in any case, there are probably more than one API attempting to achieve the same goal. We developed it, but our focus has always been on the main Storage API and , of course, the compute api

TomNicholas commented 2 months ago

Thanks for the quick response @JosepSampe !

When we began developing the cloud proxy internally in 2019, the framework you mentioned didn't exist.

That's totally reasonable, but I'm not clear from your answer whether you would be for or against using cloudpathlib now?

users who want to use Lithops can easily transition to an API they already know, that's why we implemented it.

I agree the API you mentioned looks good, but in any case, there are probably more than one API attempting to achieve the same goal.

These are the reasons I like cloudpathlib! I notice that your existing storage API interface is inspired by the os module in the python standard library, so that it is familiar to users. The cloudpathlib CloudPath object follows the interface of the Path object in the pathlib module in the python standard library, so that it is familiar to users!

The other reason it's nice is that the two projects seem to care about exactly the same scope: abstracting away details of different cloud providers by providing a common interface, but not trying to extend that interface to work in non-cloud contexts. It seems to me that what lithops aims to do for cloud serverless APIs cloudpathlib aims to do for cloud storage APIs, so the two projects might therefore fit naturally together.

cisaacstern commented 2 months ago

Thanks for @TomNicholas for voicing the question and @JosepSampe for the fast engagement!

The only thing I think I'd add, as a relative newcomer to Lithops, is that the documentation placing a (roughly) equal emphasis on the Storage API alongside the Compute API has been hard for me to understand.

Perhaps I misunderstand the focus of the project, but from my naive newcomer perspective, it feels to me that in 2024 at least, the truly novel and unique contribution of Lithops is the Compute API, which other packages such as Dask etc do not offer a replica of (even if they try to solve similar user problems, they do it in different ways, with different tradeoffs).

I am not aware of another fully OSS Python project that offers a seamless abstraction over both local multiprocessing and cloud-agnostic serverless parallel data processing. This uniqueness and the elegance of the Compute API implementation is what had lead me to recommend we adopt Lithops as the core framework for a contract I am currently working on.

By contrast, to me it has seemed that the Storage API mostly exists as an enabler of the uniqueness of the Compute API... (am I misunderstanding, for example, that it's used to facilitate storage monitoring?)...

Have I misunderstood the relationship of these components?

abourramouss commented 2 months ago

After developing several workflows with Lithops I think it would be a great addition. In my opinion, one of the most painful things with lithops is the Storage layer, not because of the API, but because you need to work with files after downloading them to a remote worker, and that means using the OS storage api.

One thing I've talked about with @danielBCN was the ability to have some sort of abstraction or object that you can instantiate around, this object would be a lazy/transparent representation of some path in object storage.

At some point i've also tried to create an adapter between cloudpathlib and lithops: https://github.com/abourramouss/cloudpathlib-lithops-adapter

gilv commented 2 months ago

@abourramouss @cisaacstern @TomNicholas I understand there might be different API for Storage layer. We implemented our own in Lithops, you suggest another way. What is rationale for your suggestion? Do you have some specific use case where existing API doesn't work? Or it just a matter of convenience? In any case, even we implement another Storage API then all changes should be backward compatible and Lithops will need to have 2 Storage APIs... We can't just replace existing with new one and that's it

aitorarjona commented 2 months ago

Hi, IMO this is quite straightforward: first, it is true that the storage API was implemented when smart_open or cloudpathlib did not exist yet, and now it would be costly to refactor and get an equivalent functionality; but if someone needs some cloudpathlib functionality (or simply because it's more robust or efficient for instance) that lithops storage does not implement, it can be installed it in the runtime and used without any issue in the code, both can coexist and be used when needed for each specific use case

TomNicholas commented 2 months ago

What is rationale for your suggestion?

My rationale is (a) convenience for the user (pathlib is nicer to work with than os) and (b) convenience for lithops developers in the long term (not having to maintain the Storage Layer yourselves). There may be other advantages too.

In any case, even we implement another Storage API then all changes should be backward compatible and Lithops will need to have 2 Storage APIs... We can't just replace existing with new one and that's it

Of course there is always a trade-off with any refactoring. But if we felt that this idea would really save effort in the long term, it's not impossible to make a switch - you just have a (long) deprecation cycle. i.e:

add the path-like implementation of the storage layer,
use the new storage layer internally,
add DeprecationWarnings when importing the original storage layer API,
then after a sufficiently long time for the users to make the change (e.g. 6-12 months for such a big change) you actually remove the original API.

Totally up to you if you (or others) think this is worthwhile, but it can be done without keeping two APIs around indefinitely.

EDIT: Another thought: Arguably the best time to make this kind of change is earlier on in a project's life span, when you have fewer users who will be impacted.