mih commented 11 months ago

This is a common concept, and seems to suggest itself naturally. psychoinformatics-de/datalad-concepts#58 also includes it.

However, it comes with problems too, in particular in the datalad context.

what would be a suitable identifier for a File?
is this a versioned construct (implying that an identifier also needs include/reference a version)?
is there are difference between a File and its content? If not, what about two files with different names and identical content?

mih commented 11 months ago

https://github.com/psychoinformatics-de/datalad-concepts/blob/main/src/linkml/datalad-datasets.yaml tries to avoid the concept of File (not entirely, but almost).

The closest equivalent is a DirectoryItem. Its role is simply and solely to assign a (unique) name in a namespace, and that namespace is a single directory. Any DirectoryItem has content, and that content is either a directory or file content (blob).

With such a concept, the majority of metadata is attributed to the file content, and DirectoryItem is a merely a contextual helper that registers content in a container.

jsheunis commented 11 months ago

Thanks for the pointer. I looked at DirectoryItem and FileContent and I think these classes cover all bases, at least from the perspective of a ResearchDataset from psychoinformatics-de/datalad-concepts#58 which only defines md5sum, url, path and size. I can refactor my code to make use of these existing classes.

One thought I had was whether these concepts of DirectoryItem and FileContent etc fit specifically into the context of datalad-datasets (where they are currently located), or whether they are generic enough to be defined together with the generic Dataset and DatasetVersion concepts?

mih commented 11 months ago

The location of any of these is temporary. All classes are drafts. If we found a second use case for them now, it makes sense to more the elsewhere.

jsheunis commented 11 months ago

There's something that I don't quite grasp how to map onto existing concepts and procedures yet. I also don't know yet exactly what my question is, so I'm putting down a progression of thoughts.

Let's look at the concept of a file (and a dataset being a collection of files) from the perspective of a user generating metadata from local files or entering metadata into some GUI (web-based or not). The ideal is to make them do the least amount of work necessary to generate the maximum amount of useful metadata. Let's say they want the metadata to include the complete file list. What they could conceivably do would be to:

run a script that generates a complete file list such as basic command line tools or something like status2tabby), or
point the GUI to a local directory where the GUI would run a similar script, or
hand-edit a sheet with a list of files

The important part is that the resulting file list is in a format that will validate against the dataset schema. But also, such a format should not be too complicated for users to generate. A flat file list would be the simplest, with files as rows and a column each for properties such as path (relative to root), file_size, checksum, access_url, etc.

So the question becomes, is the format that the users provide their file lists in (with help from a machine or not) the exact same as the format that the schema defines? Or is there some translation layer in between? Or can the schema be defined in such a way, using classes that inherit from superclasses, that the translation of a complicated structure to a flat list is implicitly dealt with inside the schema?

Using our existing work, can Directory and DirectoryItem and Filecontent somehow be brought into a single class File used by the research-dataset-schema? Or do we need a translation step?

Something else to keep in mind is the high likelihood that automated processes will run on top of the schema to generate e.g. online forms. And a form that asks you to enter a flat list of files is much more desirable than a form that asks you to enter a directory, then several directory items, etc, etc.

mih commented 11 months ago

From my POV the needs and solutions you describe are "front-end". To put it bluntly, the input convenience is bought by ignoring the true nature of the underlying concepts.

If a tool facilitates the entry of metadata on an unversioned data "archive", in can make shortcuts and it can use a simplified schema (geared towards simplicity and usage such as form generation). But this would be different from a structure and terminology used for a generic data model (which also must be able to capture more complex cases, such as nesting, versioning, redundant availability), yet still yield a sensible, homogeneous representation.

In short: yes, translation/mapping needed.

This should not be an uncommon need, hence needs to and will be supported well

mih commented 11 months ago

psychoinformatics-de/datalad-schema#15 brings another case like this: a model of a Git commit. From the Git data model perspective things are simple. A commit is

a tree
a user record (plus timestamp) for the commit
a second user record (plus timestamp) for the authorship
a list of any parent commits

A fairly sensible model could be a flat set of properties for each of these aspects. However, those would have quite complex (or narrow) semantics.

psychoinformatics-de/datalad-schema#15 uses a PROV inspired approach. Rather than direct properties, it records the provenance of a commit as two activities (the authoring of the new state vs the committing). This yields a more complex data structure, but each element has simpler (more genericly understood) semantics.

mih commented 11 months ago

31 brings some changes in this regard. It follows the model of DCAT that distinguishes abstract/conceptual resources that are realized with concrete distributions.

For datalad we can keep that distinction to express how one and the same file can be available from multiple remotes. The DCAT notion is more flexible, it allows for a resource's nature to change considerably (file formats, etc) between distributions.

For DataLad we do not need this flexibility, but it does not hurt to have the base model offer this expressiveness.

psychoinformatics-de / datalad-concepts

Concept of a `File` necessary? #14

31 brings some changes in this regard. It follows the model of DCAT that distinguishes abstract/conceptual resources that are realized with concrete distributions.