glotzerlab / signac

Manage large and heterogeneous data spaces on the file system.
https://signac.io/
BSD 3-Clause "New" or "Revised" License
128 stars 35 forks source link

Replace data space with workspace in docstrings #743

Open cbkerr opened 2 years ago

cbkerr commented 2 years ago

New focus is on pinning down what "data space" means

https://github.com/glotzerlab/signac/issues/743#issuecomment-1100005467

Original issue description

Summary

Consider the following analogy: "The directory of the job's workspace is to job as the directory of the project's workspace is to project." It is currently false!! Fixing this would break things, making it a good candidate for 2.0.

The fix would make the following analogies true:

Problem Details

What we have now is: "Directory of the job's workspace is to job as directory of the project is to project." (If your head is spinning like mine was, read those again after reading the rest of the issue)

Some example usages in the documentation:

Here is an illustration of the problem. When developing dashboard, to display an image from a job or project, you need to get the job or project directory in a general way. The way I found to do this was job_or_project.fn("") because currently the separate syntax is job.workspace() or project.root_directory(). Both are aliased to .path().

Solution

  1. Replace "job workspace", "job workspace directory", and job.workspace() with "job directory", or job.path() (also accessible with job.fn("")). The job directory is a directory containing files associated with a signac job. Currently job.workspace() is an alias for the job path. I prefer writing "job directory" rather than "job path" in the documentation, even if you would write job.path() in the code, because a directory is a container, which is a distinct concept from the path that identifies the container. This deprecation is announced in #685.
  2. Reserve the word "workspace" as a directory containing directories and applies to entities that act like a signac project.

Schematic

# Current
project/                                       <-- the project directory
project/workspace/[jobid]/                     <-- the default, a job directory contained in project workspace

# What this enables in the future (aligning with what Vyas and Simon have brought up)
project/workspace/[jobid]/workspace/           <-- after upgrading a job to a project
project/workspace/[jobid]/workspace/[jobid]    <-- adding some new jobs in the sub project

Benefits

Signac roadmap for context

I then realized that @vyasr already mentioned this idea in the tentative signac roadmap coming at it from a different angle. I think that means it's a good time to open a focused discussion on it. He suggested:

Using path instead of Project.root_directory and Job.workspace to facilitate a unified Directory interface for working with arbitrary filesystem layouts

Does this writeup capture your idea @vyasr?

joaander commented 2 years ago

What are the differences in semantics between "project workspace" and "project data space"? Or are they synonyms?

vyasr commented 2 years ago

Yeah, my proposed change is intended to address this problem in a slightly different way. Essentially, both a Job and a Project are directories. A directory has a path. Therefore, both of them should have a path, which fixes the analogy.

The concept of a workspace is a little more specific, relating to the exact directory layout currently used by signac. The data model can be roughly described as "A root directory, which we call a Project, contains a subdirectory called its workspace. That workspace directory in turn contains one subdirectory per data point, each of which is called a Job." A Job therefore does not have a workspace.

In fact, the solution that you proposed (allowing jobs to themselves contain workspaces) is precisely what @csadorf and I were trying to get at when we discussed the long-term roadmap and I made the case for both Job and Project subclassing a generic Directory! Both Job and Project are Directories, so they have a path, and that is independent of a particular layout. A given Project needs to have a well-defined layout, which is a higher-level concept that currently encompasses the workspace as well. By encoding that layout in a standalone "data model" concept, we would allow users to define different data layouts such as the nesting that you proposed. The project/workspace/job/ hierarchy is a specific data model that just happens to be our default.

cbkerr commented 2 years ago

What are the differences in semantics between "project workspace" and "project data space"? Or are they synonyms?

@joaander I've found 2 definitions of "data space" in the docs:

cbkerr commented 2 years ago

the solution that you proposed (allowing jobs to themselves contain workspaces) is precisely what @csadorf and I were trying to get at when we discussed the long-term roadmap

@vyasr I made the connection between fixing the double meaning of workspace and your future "data model" after thinking about how to clarify the definition of workspace. I wrote out the future directory structure to show myself how clarifying the definition helps resolve some of my confusion around your idea. I will clarify my initial example that I was applying "my proposal" to the idea I had heard you discuss.

I think we are mostly on the same page! (edit: the following is a misconception corrected below). However, I don't think a Job is a directory For instance: job = project.open_job({a: 1}) creates a job but not a directory until you job.init(). I don't have as clear an example, but I don't think a Project is a directory either. I would be comfortable saying that in the future, Job and Project both inherit from Directory, but not that they are directories. It also feels like we need to distinguish between the concept of a project (or job) and how it shows up on the file system.

By encoding that layout in a standalone "data model" concept, we would allow users to define different data layouts such as the nesting that you proposed.

What's a "data model"? I prefer the other term you use "data layout". But I could see other options too like "project structure/template/layout" or "file/directory layout". You use "file layout" in the roadmap.

bdice commented 2 years ago

Inheritance relationships like class Project(Directory) are usually described in programming with the terminology “is a,” as opposed to composition patterns that use “has a.” Not to get too deep into ontology but that word choice is common in CS. https://en.wikipedia.org/wiki/Is-a

In that sense, a Project or Job “is-a” Directory under the proposed class hierarchy.

bdice commented 2 years ago

Fumbled buttons on my phone. Reopening.

edit: … twice.

cbkerr commented 2 years ago

Thank you for clarifying that!! I'll add a note about it to my comment but preserve my expressed confusion.

joaander commented 2 years ago

What are the differences in semantics between "project workspace" and "project data space"? Or are they synonyms?

@joaander I've found 2 definitions of "data space" in the docs:

I brought this up as it became an issue when writing the workflow tutorial for hoomd: https://hoomd-blue.readthedocs.io/en/v3.0.1/tutorial/05-Organizing-and-Executing-Simulations/01-Organizing-Data.html

The signac tutorials use the word "data space" a lot, so I introduced that concept first. But then signac mandates the directory name is "workspace". It is confusing for users (especially new users) when more than one word describes the same thing. If they are the same, it would be good to only use one - workspace since that is the required directory name. If they are different, then they need to be defined clearly and used consistently.

cbkerr commented 2 years ago

It is confusing for users (especially new users) when more than one word describes the same thing.

Totally agree!

Issue tracking glossary: https://github.com/glotzerlab/signac-docs/issues/59 Google doc on defining terms: https://docs.google.com/document/d/1_merhcK3ohas4IloE616yL7gypMRFcaQLh2oChExC7o/edit?usp=sharing

vyasr commented 2 years ago

@cbkerr could you update this issue in case there were any important/useful/relevant points made in the meeting today that you think would help contribute to this discussion?

cbkerr commented 2 years ago

Now that #685 and #752 and I think all that is left of this issue is to phase out the use of "data space".

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

cbkerr commented 1 year ago

@stale-bot this is not ready to be closed. This should remain open because "job workspace" still returns many hits in the next branch.

./signac/__main__.py:912:        help="Print the job's workspace path instead of the job id.",
./signac/contrib/import_export.py:741:    """Copy the source to job's workspace.
./signac/contrib/import_export.py:771:    """Copy the source to job's workspace when the source is a directory.
./signac/contrib/import_export.py:872:    """Copy the source to job's workspace when the source is a zipfile.
./signac/contrib/import_export.py:1006:    """Copy the source to job's workspace when the source is a tarfile.
./signac/contrib/import_export.py:1209:    data space paths that can be imported as a job workspace into project.
./signac/contrib/job.py:610:        """Initialize the job's workspace directory.
./signac/contrib/job.py:705:        """Remove the job's workspace including the job document.
./signac/contrib/job.py:853:        """Enter the job's workspace directory.
./signac/contrib/job.py:863:        Opening the context will switch into the job's workspace,
./signac/sync.py:210:    """Synchronize two job workspaces file by file, following the provided strategy."""
./signac/sync.py:298:        The src job, data will be copied from this job's workspace.
./signac/sync.py:300:        The dst job, data will be copied to this job's workspace.

I made a more specific issue to track usage of "data space": https://github.com/glotzerlab/signac/issues/809

vyasr commented 1 year ago

@cbkerr any activity (including your comment) will cause stalebot to remove the stale label, but it will reapply it again as soon as the issue goes inactive again. If you want to keep an issue open permanently, you need to add the pinned label (left to you as an exercise if you think it's worth keeping this open indefinitely even if nobody puts in the effort to fix it 😉).

vyasr commented 1 year ago

@cbkerr could you revisit this now and see what you would like to change? IIUC the remaining action item is to remove all references to a job's "workspace" in docs in favor of a job's "directory" or the "path to a job" or something along those lines, is that correct? Would you be able to make that change?

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

cbkerr commented 1 year ago

All references to job workspace will be gone after https://github.com/glotzerlab/signac-docs/pull/185.

I'm changing the name of the issue to better track that we need to resolve this comment: https://github.com/glotzerlab/signac/issues/743#issuecomment-1100005467