How to register larger file structures without their files

olinux commented 2 years ago

The result of the EBRAINS image service is a SWIFT bucket with multiple hundred thousands of files. We wouldn't want to index all of them in openMINDS as individual files since the structure is interpreted as a whole by the clients anyhow. With the current specifications, we could think of specifying it as a "file repository" and specifying the correct "ContentType" (what would this be for this case?) for making it identifiable for the clients for correct interpretation. However, the question remains if this has to be registered as a derived data set of the originally processed data or (if we integrate it into the original dataset), we would require an additional possibility to link it (since it would be a different semantic relation than "fileRepository" of the RPV).

What are your thoughts @lzehl ?

FYI: @xgui3783 @majpuc

Majpuc commented 2 years ago

For your info if relevant, I create ingestion at the level of TSC. We couple the TSC files to the service links via the file bundles. This implicate that one dataset in the KG has several swift buckets linked to it, one for each TSC and service link.

xgui3783 commented 2 years ago

I would certainly think it would have to be

a derived data set of the originally processed data

I should mention that while it does produce many hundreds/thousands of files, with respect to a specific contentType (e.g. deep zoom image or neuroglancer precomputed format), only 1 URL is relevant.

xgui3783 commented 2 years ago

TSC

apologies, what's TSC?

Majpuc commented 2 years ago

Sorry :) Tissue sample collection

olinux commented 2 years ago

@apdavison would it be an option to use a computational schema to link a "fileRepository" to the process (and the research product version) similar to "visualization"?

Maybe we could call it "conversion" and let it register a "fileRepository", "fileBundle" or "file" as input and a (non-indexed) "fileRepository" containing a specific content-type (as Xiao has mentioned above) as output. We might also want to register the service/softwareAgent which executes the process (in our case "ImageService")?

(discussed with @lzehl )

apdavison commented 2 years ago

My memory is that this was one of the main original use cases for the FileBundle concept.

@olinux The current computation schemas all allow File and FileBundle as outputs, but not at the moment FileRepository. Semantically it makes more sense to think of a computation creating a FileBundle. I think it's a good idea to use the computation schemas to keep track of this. I'll create a new Computation sub-type (maybe "FormatConversion", or more generically "PreProcessing"?) and make a PR.

lzehl commented 2 years ago

@apdavison sounds good. could we specify a fileRepository as output though (as well) for this specific schema? Reason: fileBundles can only be registered as part of an existing fileRepository. services such as the image service produce data that should not be registered in a dataset(version) and would therefore need to specify their fileRepository directly.

apdavison commented 2 years ago

@lzehl I would suggest to have the FileBundle as output, and also create a new FileRepository to which it is linked through the "is_part_of" property

lzehl commented 2 years ago

@apdavison could you elaborate why? for the image service this would be an overhead as far as I understood. they are going to create file repos per computation so the file bundles would equal the file repos .

apdavison commented 2 years ago

@lzehl are the file repos (CSCS Swift containers I guess?) created automatically by the image service, or are they created manually beforehand?

lzehl commented 2 years ago

@apdavison as far as I understood they are created automatically. @olinux @Majpuc correct?

apdavison commented 2 years ago

ok, so in that case having FileRepository as an output would be an accurate representation.

I think it comes down to whether this is a rare edge-case (the vast majority of computational processes in EBRAINS do not produce repositories) or not.

If it's a very rare case, the costs of having to handle more complex schemas in the UI might outweigh the small overhead of having to resolve the "is_part_of" link.

On balance, I'm happy with either solution.

Majpuc commented 2 years ago

Regarding the chunk containers. They are created automatically. Until recently cscs public containers connected to an icei project belonging to Marc’s team but after a change in the image service are now collaboratory buckets accessible through the data proxy api. I have collected all metadata connecting the containers ID and their corresponding datasets.

lzehl commented 2 years ago

@olinux could you comment on @apdavison remark. I think he has a point, but I'm not sure what is less overhead:

handling the increase in query complexity when we allow FileRepository as additional output for the new schema OR
handling the increase in query complexity, because all containers from the image service have to be registered as first as "free floating" FileRepository and then indirectly link them as output via a FileBundle to the new schema

olinux commented 2 years ago

Although I do agree with @apdavison that from a conceptional point, it's better to have a FileBundle as an output, the key problem is that a "FileBundle" doesn't have its own IRI.

Here, we're talking about a FileBundle which explicitly is not supposed to have any relations to files (since we're not going to index them). This leaves the interpretation to the client which would need to know that "if I have a FileBundle of type X without link to any files, I have to implicitly assume that the source is the full repository". This is a lot of implicit model knowledge which we push on the client. If we would use the FileRepository (although it's an edge case), the granularity would be clear.

lzehl commented 2 years ago

@olinux what about combining this issue in some way with #257 ?

Suggestion:

keep the FileBundle only for the cases where the Files of a FileRepository are indexed.
create a FileArchive schema that can exist independent of a FileRepository which is stating the full URL to a container, container+prefix (down to *.zip file url) where the respective individual Files should not / cannot be indexed

FileArchive properties:

IRI (required)
content (required, string)
format (required, contentType) OR archiveType (controlledTerms/FileArchiveType; new)
fileNumber (optional, integer)

The format (contentType) OR archiveType needs to make clear what we are looking at. ZIP etc are proper archive/container formats which have classical contentTypes. A CSCS container of the image service is not really a classical archive/container format, but needs to "just" tagged to be treated as such (meaning no indexation of the contained files).

@apdavison @dickscheid @Majpuc would that fit the needs for now? @olinux do we need to make an explicit relation to an indexed File for, e.g., an indexed zip of a dataset? or is the IRI enough? could that crash because file IRIs are being used as instance identifiers in the KG?

More detailed properties can be added at a later point in time as well of course. For example optional navigation instructions for the files contained in the archive.

Majpuc commented 2 years ago

This sounds as a nice solution. For info, concerning Image service generated chunks, the archives types are both CSCS classical containers and collab buckets. The format is .dzi and .png (see example here: https://wiki.ebrains.eu/bin/view/Collabs/img-d6a8e5ab-eba3-4799-8787-c0e858df8515/Bucket) or neuroglancer files (https://wiki.ebrains.eu/bin/view/Collabs/ng-chunks/Bucket). We need to keep the file structure and formats as the viewers and tools need to be able to read those files (no zipping possible).

lzehl commented 2 years ago

@Majpuc thanks for the feedback and the additional information. Let's wait for the feedback of the others (@olinux, @apdavison , @dickscheid).

olinux commented 2 years ago

Hi @lzehl , I think it's a good idea to not overload the "file repository" with multiple meanings and therefore support the suggestion to have a new "FileArchive". The way I understand #257 is meant would be different though since it asks for being able to reference files inside the archive.

We had another idea which is immediately compatible with everything we already have: Why don't we specify a new file format (with a unique content type) (formatted e.g. in JSON) containing all the information needed to interpret such a container (maybe there are other parameters than just the IRI)? The image service could generate this file, add it to the file repository of the original dataset (so this "intermediate file" becomes part of the dataset and therefore is properly linked to it) and flag it with the right content type.

On the interpretation side, all a client (such as the 3d viewers) has to do is to look up the KG for this content-type, read in the file and interpret the well-structured information.

I could already see other use-cases we could support with such an approach whilst keeping the metadata-structure stable.

lzehl commented 2 years ago

@olinux not sure what you mean. Let's discuss this in person. But already before:

FileArchive properties:

IRI (required)
content (required, string)
format (required, contentType) [requires us to define at least one content type for the image service]
fileNumber (optional, integer)
configuration (optional, Configuration | cf. suggestion in #322)

Storing something with the data that were already released (for example) is not a good option (because it changes the data content which received already a DOI).

The linkage between Dataset and ImageService Container should be done through the service Link which we need to extend in a minimal way by adding a porperty "sourceData"

lzehl commented 2 years ago

Report of in-person meeting between @olinux , @xgui3783 , @alexisdurieux , @marcnm :

We are going to setup a FileArchive schema with the following properties:

"IRI" (required; single value; string, format: iri; instruction: link to the image service bucket / container)
"format" (required; single value; object, _linkedTypes: ContentType; instruction: link to image service content types, see below)
"sourceData" (required; value array, object, _linkedTypes: File; instruction: link to the source files ingested by the image service)

The ServiceLink schema will be updated:

existing property "dataLocation" will also point to FileArchive (@olinux since source data are specified in that schema, are you able to query this for the KG Search so that we do not have to specify the source data again in the ServiceLink?)

We will need at least two ContentTypes for the EBRAINS image service: ebrains-image-service.neuroglancer & ebrains-image-service.microsoft-deep-zoom (@apdavison @jagru20 would that be correct? better ideas?)

@xgui3783 what does DCI stand for?

Further action points for EBRAINS: Adding at least one FileArchive instance to the KG as proof of concept (best one for a dataset version where the service connected though the service link actually uses the data in the FileArchive).

Outlook: On the long run the EBRAINS image service could consider automatic registration of the whole provenance producing the FileArchives using the openMINDS_computation extension.

@apdavison & @dickscheid FYI

Majpuc commented 2 years ago

Hi guys, not DCI. It is DZI.

lzehl commented 2 years ago

@Majpuc :grin: thanks and what is the full name?

Majpuc commented 2 years ago

Deep zoom image. https://openseadragon.github.io/examples/tilesource-dzi/

lzehl commented 2 years ago

thanks; not sure if I have all information needed here though.... the format of the single files is DZI but is the central service that can ingest the respective image service microsoft-deep-zoom ???

Majpuc commented 2 years ago

No, for ingestion of images we have input formats and output (chunk)formats. I will add you to the collab where some of this is described.

lzehl commented 2 years ago

@Majpuc thanks but that is not what I mean.

lzehl commented 2 years ago

who is using the dzi files that are created by the image-service?

Majpuc commented 2 years ago

DZI chunk format is used for all 2D image data, both viewers (Localizoom and Multi-image OSd viewer) and tools ( WebAlign and Scaler service)

lzehl commented 2 years ago

solved in #325

openMetadataInitiative / openMINDS_core

How to register larger file structures without their files #318