hubmapconsortium / files-api

A RESTful service to for getting information about and registering files
https://files.api.hubmapconsortium.org
MIT License
0 stars 0 forks source link

Method to create `file_info` ES document #2

Closed shirey closed 1 year ago

shirey commented 2 years ago

We need to create an Elasticsearch index of all published files associated with datasets (not Ingest UI uploaded files)

To accomplish this need a standalone method running in the file-api given input: (string) single dataset uuid will output: (a list of dictionaries) one dictionary per file in the dataset with the following attributes:

AlanSimmons commented 2 years ago

@shirey: After discussion with @kburke, it seems like this asks for more than just file information. Organ, donor, and sample data would be more like provenance data, no?

kburke commented 2 years ago

I know @shirey is going to add clarification about where I should source data for this new endpoint when he is free. For now, I'm starting with my familiar territory in uuid-api, which I believe is the only places for something like file size, unless I'm working directly with the file system.

I assume this will be supplemented with info from entity-api or the file system, and merged to form the response, as suggested by @AlanSimmons yesterday.

I would like clarification about "organs|donors|samples associated with a dataset" and possible second-level and beyond entities "associated" with them.

For now, I'm starting with the assumption I just want direct ancestors of the query dataset, and will not follow branch that may go off an ancestor. image

I also want all the descendants of the query dataset, no matter how many datasets it spreads out into. image

To repeat the above two assumptions and be explicit, here's a pic where I don't want other samples of the sample which is the parent of my query dataset, or anything under them. I'm pretty sure that isn't what is meant by "associated with", but want to confirm. image

shirey commented 2 years ago

@kburke (and @AlanSimmons )

organs, samples and donors information should come from entity-api or neo4j (query up the provenance graph from the dataset). data_types field comes from neo4j or entity-api- it is a property on datataset. description and type_code comes from the stuff that Alan has worked out. The rest, I think, comes from a call to uuid-api.

I don't see a need to get any descendant information.

kburke commented 2 years ago

I dropped current-state JSON at https://gist.github.com/kburke/23e03ae9da5d5c5b390e0ff1634d38c1. The structure isn't totally aligned to requirements above. For example, want to confirm if the Dataset UUID should be repeated inside each file element instead of listed once, see if what I've stuffed under 'ancestors' needs to denormalize, etc.

@AlanSimmons it might help to find a few Datasets to I can develop toward, and some held-out Datasets when you're ready to check me. I recommend:

  1. A small one you can look at and tell if it is right
  2. A massive one to test strength of implementation
  3. A complex one, which I think means a Dataset descended from multiple Datasets.
AlanSimmons commented 2 years ago

@kburke The provenance model is on slide 4 of this presentation.

AlanSimmons commented 2 years ago

@kburke There are two types of "dataset types", corresponding to the types of assays:

  1. Primary - directly from the lab
  2. Derived - in which the primary data is processed with an analysis pipeline

In general, a primary dataset type will have files in one file path and derived datasets in another file path. The file paths are mutually exclusive.

Example of a primary dataset type that you can access: CODEX Example of a derived dataset type that you can access: CODEX (Cytokit + SPRM)

AlanSimmons commented 2 years ago

Derived datasets can, themselves, derive from other derived datasets.

Example: HBM279.JRTJ.535.

Look at the provenance graph (Graph link in the Provenance section).

image

AlanSimmons commented 2 years ago

@shirey Two questions:

  1. How do we handle cases in which we cannot match a file to a pattern? Do we return nothing, or "not found", or perhaps just the file extension?
  2. How do we handle cases for files for dataset types that contain PII? These would never be in Globus, anyway, but in dbGap.
kburke commented 2 years ago

@AlanSimmons Just to be clear, in your prior comment, both these questions apply only to the "description" and "type_code" attributes of any single file element in the JSON response. Currently, without an exact match to a file pattern, these attributes are absent, but the file element itself is still present (including the file extension, for example.)

I've updated the gist with the current response for dataset 15ec310a304e1d4891cd33f4bc4cb197, which contains no file descriptions because everything is buried under 'Proteomics'.

A more conventional, current response is this new gist for 02e13b9b3cdc939cca397c42c2981dd1, in which file "description" is resolved from your spreadsheet work.