Closed jstriebel closed 2 years ago
I agree with everything you said and also prefer the first option of each use case :+1:
The Dataset
class, which is the most used entrypoint of the library uses Dataset(..)
for open (and create if not available) and Dataset.create(..)
for create. Maybe we should make the others like this instead of changing this API?
@normanrz and I agreed on using the methods as proposed above, but need specific names. Here's a table with some options, the once to discuss are mentioned below:
new | open | save | download | upload | |
---|---|---|---|---|---|
Dataset | wk.Dataset(…) |
wk.Dataset.open(…) |
permanently persisted | wk.Dataset.download(…) |
dataset_obj.upload(…) |
Annotation | wk.Annotation(…) |
wk.Annotation.open(…) |
annotation_obj.save(…) |
wk.Annotation.download(…) |
annotation_obj.upload(…) |
Skeleton | wk.Skeleton(…) |
wk.Annotation.open(…) |
skeleton_obj.save(…) |
only within an annotation | only within an annotation |
… |
Namespacing the loading methods (open, download) via the class instead of having the object-type in the top-level function name seems quite useful. Some libs (e.g. pandas) have those as top-level functions, but also don't produce so many different object-types and can infer them during creation. Alternatively, we could also namespace open and download functions via modules, e.g. wk.dataset.open(…)
.
Typical CRUD objects, such as of the User
class should be treated similarly,
e.g. wk.User.get_current_user()
to get the logged-in user,
or wk.Task(…).upload
to create and upload a new Task.
We'll start with support for ids and organization names for up- & download everywhere first, but will also try to overload the methods to accept urls.
For backwards-compatibility in datasets, we'll keep the Dataset.create
method for now with a deprecation warning, as well as trying to overload the constructor to accept a path to open.
The verbs for open and save are very much up for discussion. Some options (can also be mixed and matched between those pairs):
Also, it's not clear if paths or files or both should be used, e.g.
wk.Dataset.open("my_path")
vs.
with open("my_path") as f:
wk.Dataset.open(f)
Probably simply both?
We might also use suffixes to indicate a special format (e.g. dump_nml), but I'd rather infer this from the filename.
Some considerations: from
and to
are used to indicate type-conversions (in-memory, not in-place) of the bounding box and are chosen on purpose (@fm3 had great reasons I mostly forgot). Other libs use various verbs and sometime strange combinations. Some references:
read_csv
(previously from_csv
), to_csv
, allowing both paths and filesloadtxt
or genfromtxt
or load
, savetxt
or save
, allowing both paths and filesread
and write
on filesload
and dump
using filesload
/save
to interact with numpy arrays, open
to open as a zarr array that is always persisted (similar to our Dataset)I tend towards load
/save
or load
/dump
, since I find the in-memory conversion semantics of to
/from
convincing (e.g. skeleton.to_annotation()
or Annotation.from_skeleton(…)
) and open is already a built-in keyword with it's own semantic, same for read
and write
on files.
We might make the distinction between loading to have a complete in-memory representation which must be dumped for saving, or opening a dataset which is always persisted. Then we might use open
for Datasets and load
for Annotations and Skeletons. However, this might be more confusing for users than it helps to communicate the distinction between the concepts. (Just might come in handy when we start adding in-memory datasets, then we could have both Dataset.open
and Dataset.load
)
Any opinions on those topics or other feedback? cc @normanrz, @hotzenklotz, @philippotto
PS: Some future classes we might introduce which should work with the same concepts: Mesh
, Mapping
, and maybe VolumeAnnotation
as it's own class which can be persisted, Project
, Task
, TaskType
, Jobs
, etc, which are rather CRUD objects that cannot be stored locally, similar to the current User
.
Thank you for the write-up, @jstriebel :+1: Lots of valid points in there.. I've got some comments:
open, save
In my opinion, the pendant to open
should always be close
, unless there is a very strong reason (which I'm not seeing here).
from and to are used to indicate type-conversions (in-memory, not in-place) [...]
Agree :+1:
I tend towards load/save or load/dump [...]
We might make the distinction between loading to have a complete in-memory representation which must be dumped for saving, or opening a dataset which is always persisted. Then we might use open for Datasets and load for Annotations and Skeletons. However, this might be more confusing for users than it helps to communicate the distinction between the concepts.
load/save sound good for annotations and co, but for datasets which are on-disk the naming doesn't really make sense in my opinion (why should I call save
when writes are flushed automatically?). Therefore, the logical consequence would be to make the distinction exactly as you wrote (between full-in-memory vs full-on-disk).
However, this might be more confusing for users than it helps to communicate the distinction between the concepts.
In the simplest case, users would simply look up how to load/open/save stuff in the docs and they should be okay with it. If the difference in naming can be explained (i.e., one category is loaded completely to memory vs the other one), this seems like a better choice than to use the same vocabulary even though half of it does not make sense (i.e., "save" in the context of datasets).
load/save or load/dump
From intuition, I would assume that dump
always needs a target path. save
on the other hand could mean both "save the changes of the currently opened annotation)" and "save it here". Therefore, I find save
more elegant as it can express both intentions. If one wants to differentiate (without method overloading), one could use save
and save_as
. No strong opinion here, but in general I'd vote against dump
.
I agree with Dataset.open
and load
/save
otherwise. (That was what you both also advocated for, right?)
The CRUD classes should be discussed separately imo.
We should have a single clear way to create, open, save, download and upload the core webknossos objects, such as datasets, skeletons, annotations, … Currently they differ quite a lot and have contradicting uses (e.g. the Dataset constructor opens existing Files, the Skeleton constructor creates new ones).
Some suggestions what we could use (in the order of my personal preference):
Create
wk.Skeleton(…)
)wk.Skeleton.create()…
wk.skeleton(…)
(similar tonp.array
to create anp.ndarray
)Open & Download
I'd argue that opening and downloading should have a similar API, e.g.
wk.Skeleton.open()
wk.open_nml
orwk.open_skeleton
For downloads: Smooth out what can be specified in the download calls: urls, ids, other orgas, additional sharing tokens, …
Save & Upload
Saving & uploading should both happen on the existing object IMO, e.g.
my_skeleton.upload(…)
&my_skeleton.save
/my_skeleton.write
Besides the names of the functions/methods, also the signatures should be kept similar where it makes sense, especially for opening, downloading & uploading. E.g. download might parse a URL, but also might just accept a single id which is associated with the default organization behind the scenes, or also accept an organization explicitly as an argument.