Method to create `file_info` ES document

shirey commented 2 years ago

We need to create an Elasticsearch index of all published files associated with datasets (not Ingest UI uploaded files)

To accomplish this need a standalone method running in the file-api given input: (string) single dataset uuid will output: (a list of dictionaries) one dictionary per file in the dataset with the following attributes:

description - (string) the description of the file as described by resolving the file type via the directory schema for the assay type of the associated dataset from the HuBMAP application ontology.
type_code (string) OPTIONAL the code of the associated ontology term for the file as resolved through the HuBMAP application ontology
rel_path- (string) the path relative of where the file sites
size - (int) OPTIONAL size of the file in bytes
data_types- (list of strings) the data_types of the assocaiated dataset
dataset_uuid- (string) the uuid of the associated dataset
organs - (list of object) the list of organs associated with the associated dataset with the following properties
- uuid - (string) the uuid of the organ ancestor
- type - (string) the organ type of the ancestor organ, resolved to the readable description of the organ
- type_code - (string) the organ code of the ancestor organ
file_extension - (string) the file extension of the file
donors - (list of object) the list of donors associated with the associated dataset with the following properties
- uuid - (string) the uuid of the associated donor
- age - (int) OPTIONAL age in years of the associated donor
- race - (string) OPTIONAL race of the donor
samples - (list of object) the list of all tissue Samples associated with the associated dataset with the following properties
- uuid - (string) the uuid of the associated tissue Sample
- type - (string) the type of the sample resoved to the description of the associated tissue Sample
- code - (string) the sample code (specimen_type in the db) of the associated tissue Sample

AlanSimmons commented 2 years ago

@shirey: After discussion with @kburke, it seems like this asks for more than just file information. Organ, donor, and sample data would be more like provenance data, no?

kburke commented 2 years ago

I know @shirey is going to add clarification about where I should source data for this new endpoint when he is free. For now, I'm starting with my familiar territory in uuid-api, which I believe is the only places for something like file size, unless I'm working directly with the file system.

I assume this will be supplemented with info from entity-api or the file system, and merged to form the response, as suggested by @AlanSimmons yesterday.

I would like clarification about "organs|donors|samples associated with a dataset" and possible second-level and beyond entities "associated" with them.

For now, I'm starting with the assumption I just want direct ancestors of the query dataset, and will not follow branch that may go off an ancestor.

I also want all the descendants of the query dataset, no matter how many datasets it spreads out into.

To repeat the above two assumptions and be explicit, here's a pic where I don't want other samples of the sample which is the parent of my query dataset, or anything under them. I'm pretty sure that isn't what is meant by "associated with", but want to confirm.

shirey commented 2 years ago

@kburke (and @AlanSimmons )

organs, samples and donors information should come from entity-api or neo4j (query up the provenance graph from the dataset). data_types field comes from neo4j or entity-api- it is a property on datataset. description and type_code comes from the stuff that Alan has worked out. The rest, I think, comes from a call to uuid-api.

I don't see a need to get any descendant information.

kburke commented 2 years ago

I dropped current-state JSON at https://gist.github.com/kburke/23e03ae9da5d5c5b390e0ff1634d38c1. The structure isn't totally aligned to requirements above. For example, want to confirm if the Dataset UUID should be repeated inside each file element instead of listed once, see if what I've stuffed under 'ancestors' needs to denormalize, etc.

@AlanSimmons it might help to find a few Datasets to I can develop toward, and some held-out Datasets when you're ready to check me. I recommend:

A small one you can look at and tell if it is right
A massive one to test strength of implementation
A complex one, which I think means a Dataset descended from multiple Datasets.

AlanSimmons commented 2 years ago

@kburke The provenance model is on slide 4 of this presentation.

AlanSimmons commented 2 years ago

@kburke There are two types of "dataset types", corresponding to the types of assays:

Primary - directly from the lab
Derived - in which the primary data is processed with an analysis pipeline

In general, a primary dataset type will have files in one file path and derived datasets in another file path. The file paths are mutually exclusive.

Example of a primary dataset type that you can access: CODEX Example of a derived dataset type that you can access: CODEX (Cytokit + SPRM)

AlanSimmons commented 2 years ago

Derived datasets can, themselves, derive from other derived datasets.

Example: HBM279.JRTJ.535.

Look at the provenance graph (Graph link in the Provenance section).

AlanSimmons commented 2 years ago

@shirey Two questions:

How do we handle cases in which we cannot match a file to a pattern? Do we return nothing, or "not found", or perhaps just the file extension?
How do we handle cases for files for dataset types that contain PII? These would never be in Globus, anyway, but in dbGap.

kburke commented 2 years ago

@AlanSimmons Just to be clear, in your prior comment, both these questions apply only to the "description" and "type_code" attributes of any single file element in the JSON response. Currently, without an exact match to a file pattern, these attributes are absent, but the file element itself is still present (including the file extension, for example.)

I've updated the gist with the current response for dataset 15ec310a304e1d4891cd33f4bc4cb197, which contains no file descriptions because everything is buried under 'Proteomics'.

A more conventional, current response is this new gist for 02e13b9b3cdc939cca397c42c2981dd1, in which file "description" is resolved from your spreadsheet work.

hubmapconsortium / files-api

Method to create `file_info` ES document #2