Unify data creation, opening, saving, downloading and uploading

jstriebel commented 3 years ago

We should have a single clear way to create, open, save, download and upload the core webknossos objects, such as datasets, skeletons, annotations, … Currently they differ quite a lot and have contradicting uses (e.g. the Dataset constructor opens existing Files, the Skeleton constructor creates new ones).

Some suggestions what we could use (in the order of my personal preference):

Create

using the constructor (currently the case for wk.Skeleton(…))
using a staticmethod on the class, e.g. wk.Skeleton.create()…
using a function such as wk.skeleton(…) (similar to np.array to create a np.ndarray)

Open & Download

I'd argue that opening and downloading should have a similar API, e.g.

staticmethod on the class, e.g. wk.Skeleton.open()
using a function such as wk.open_nml or wk.open_skeleton
using the constructor (if this is also used for creating, we need to dispatch based on the arguments which might lead to confusing signatures)

For downloads: Smooth out what can be specified in the download calls: urls, ids, other orgas, additional sharing tokens, …

Save & Upload

Saving & uploading should both happen on the existing object IMO, e.g. my_skeleton.upload(…) & my_skeleton.save/my_skeleton.write

Besides the names of the functions/methods, also the signatures should be kept similar where it makes sense, especially for opening, downloading & uploading. E.g. download might parse a URL, but also might just accept a single id which is associated with the default organization behind the scenes, or also accept an organization explicitly as an argument.

philippotto commented 3 years ago

I agree with everything you said and also prefer the first option of each use case :+1:

normanrz commented 3 years ago

The Dataset class, which is the most used entrypoint of the library uses Dataset(..) for open (and create if not available) and Dataset.create(..) for create. Maybe we should make the others like this instead of changing this API?

jstriebel commented 2 years ago

@normanrz and I agreed on using the methods as proposed above, but need specific names. Here's a table with some options, the once to discuss are mentioned below:

	new	open	save	download	upload
Dataset	`wk.Dataset(…)`	`wk.Dataset.open(…)`	permanently persisted	`wk.Dataset.download(…)`	`dataset_obj.upload(…)`
Annotation	`wk.Annotation(…)`	`wk.Annotation.open(…)`	`annotation_obj.save(…)`	`wk.Annotation.download(…)`	`annotation_obj.upload(…)`
Skeleton	`wk.Skeleton(…)`	`wk.Annotation.open(…)`	`skeleton_obj.save(…)`	only within an annotation	only within an annotation
…

Namespacing the loading methods (open, download) via the class instead of having the object-type in the top-level function name seems quite useful. Some libs (e.g. pandas) have those as top-level functions, but also don't produce so many different object-types and can infer them during creation. Alternatively, we could also namespace open and download functions via modules, e.g. wk.dataset.open(…).

Typical CRUD objects, such as of the User class should be treated similarly, e.g. wk.User.get_current_user() to get the logged-in user, or wk.Task(…).upload to create and upload a new Task.

We'll start with support for ids and organization names for up- & download everywhere first, but will also try to overload the methods to accept urls.

For backwards-compatibility in datasets, we'll keep the Dataset.create method for now with a deprecation warning, as well as trying to overload the constructor to accept a path to open.

The verbs for open and save are very much up for discussion. Some options (can also be mixed and matched between those pairs):

open, save
read, write
load, dump
from, to

Also, it's not clear if paths or files or both should be used, e.g.

wk.Dataset.open("my_path")

vs.

with open("my_path") as f:
    wk.Dataset.open(f)

Probably simply both?

We might also use suffixes to indicate a special format (e.g. dump_nml), but I'd rather infer this from the filename.

Some considerations: from and to are used to indicate type-conversions (in-memory, not in-place) of the bounding box and are chosen on purpose (@fm3 had great reasons I mostly forgot). Other libs use various verbs and sometime strange combinations. Some references:

pandas: read_csv (previously from_csv), to_csv, allowing both paths and files
numpy: loadtxt or genfromtxt or load, savetxt or save, allowing both paths and files
Python files: read and write on files
Python pickle/json, ruamel yaml, …: load and dump using files
zarr: load/save to interact with numpy arrays, open to open as a zarr array that is always persisted (similar to our Dataset)

I tend towards load/save or load/dump, since I find the in-memory conversion semantics of to/from convincing (e.g. skeleton.to_annotation() or Annotation.from_skeleton(…)) and open is already a built-in keyword with it's own semantic, same for read and write on files.

We might make the distinction between loading to have a complete in-memory representation which must be dumped for saving, or opening a dataset which is always persisted. Then we might use open for Datasets and load for Annotations and Skeletons. However, this might be more confusing for users than it helps to communicate the distinction between the concepts. (Just might come in handy when we start adding in-memory datasets, then we could have both Dataset.open and Dataset.load)

Any opinions on those topics or other feedback? cc @normanrz, @hotzenklotz, @philippotto

PS: Some future classes we might introduce which should work with the same concepts: Mesh, Mapping, and maybe VolumeAnnotation as it's own class which can be persisted, Project, Task, TaskType, Jobs, etc, which are rather CRUD objects that cannot be stored locally, similar to the current User.

philippotto commented 2 years ago

Thank you for the write-up, @jstriebel :+1: Lots of valid points in there.. I've got some comments:

open, save

In my opinion, the pendant to open should always be close, unless there is a very strong reason (which I'm not seeing here).

from and to are used to indicate type-conversions (in-memory, not in-place) [...]

Agree :+1:

I tend towards load/save or load/dump [...]

We might make the distinction between loading to have a complete in-memory representation which must be dumped for saving, or opening a dataset which is always persisted. Then we might use open for Datasets and load for Annotations and Skeletons. However, this might be more confusing for users than it helps to communicate the distinction between the concepts.

load/save sound good for annotations and co, but for datasets which are on-disk the naming doesn't really make sense in my opinion (why should I call save when writes are flushed automatically?). Therefore, the logical consequence would be to make the distinction exactly as you wrote (between full-in-memory vs full-on-disk).

However, this might be more confusing for users than it helps to communicate the distinction between the concepts.

In the simplest case, users would simply look up how to load/open/save stuff in the docs and they should be okay with it. If the difference in naming can be explained (i.e., one category is loaded completely to memory vs the other one), this seems like a better choice than to use the same vocabulary even though half of it does not make sense (i.e., "save" in the context of datasets).

load/save or load/dump

From intuition, I would assume that dump always needs a target path. save on the other hand could mean both "save the changes of the currently opened annotation)" and "save it here". Therefore, I find save more elegant as it can express both intentions. If one wants to differentiate (without method overloading), one could use save and save_as. No strong opinion here, but in general I'd vote against dump.

normanrz commented 2 years ago

I agree with Dataset.open and load/save otherwise. (That was what you both also advocated for, right?)

The CRUD classes should be discussed separately imo.

scalableminds / webknossos-libs