Closed shirey closed 1 year ago
@shirey: After discussion with @kburke, it seems like this asks for more than just file information. Organ, donor, and sample data would be more like provenance data, no?
I know @shirey is going to add clarification about where I should source data for this new endpoint when he is free. For now, I'm starting with my familiar territory in uuid-api, which I believe is the only places for something like file size, unless I'm working directly with the file system.
I assume this will be supplemented with info from entity-api or the file system, and merged to form the response, as suggested by @AlanSimmons yesterday.
I would like clarification about "organs|donors|samples associated with a dataset" and possible second-level and beyond entities "associated" with them.
For now, I'm starting with the assumption I just want direct ancestors of the query dataset, and will not follow branch that may go off an ancestor.
I also want all the descendants of the query dataset, no matter how many datasets it spreads out into.
To repeat the above two assumptions and be explicit, here's a pic where I don't want other samples of the sample which is the parent of my query dataset, or anything under them. I'm pretty sure that isn't what is meant by "associated with", but want to confirm.
@kburke (and @AlanSimmons )
organs
, samples
and donors
information should come from entity-api or neo4j (query up the provenance graph from the dataset). data_types
field comes from neo4j or entity-api- it is a property on datataset. description
and type_code
comes from the stuff that Alan has worked out. The rest, I think, comes from a call to uuid-api.
I don't see a need to get any descendant information.
I dropped current-state JSON at https://gist.github.com/kburke/23e03ae9da5d5c5b390e0ff1634d38c1. The structure isn't totally aligned to requirements above. For example, want to confirm if the Dataset UUID should be repeated inside each file element instead of listed once, see if what I've stuffed under 'ancestors' needs to denormalize, etc.
@AlanSimmons it might help to find a few Datasets to I can develop toward, and some held-out Datasets when you're ready to check me. I recommend:
@kburke The provenance model is on slide 4 of this presentation.
@kburke There are two types of "dataset types", corresponding to the types of assays:
In general, a primary dataset type will have files in one file path and derived datasets in another file path. The file paths are mutually exclusive.
Example of a primary dataset type that you can access: CODEX Example of a derived dataset type that you can access: CODEX (Cytokit + SPRM)
Derived datasets can, themselves, derive from other derived datasets.
Example: HBM279.JRTJ.535.
Look at the provenance graph (Graph link in the Provenance section).
@shirey Two questions:
@AlanSimmons Just to be clear, in your prior comment, both these questions apply only to the "description" and "type_code" attributes of any single file element in the JSON response. Currently, without an exact match to a file pattern, these attributes are absent, but the file element itself is still present (including the file extension, for example.)
I've updated the gist with the current response for dataset 15ec310a304e1d4891cd33f4bc4cb197, which contains no file descriptions because everything is buried under 'Proteomics'.
A more conventional, current response is this new gist for 02e13b9b3cdc939cca397c42c2981dd1, in which file "description" is resolved from your spreadsheet work.
We need to create an Elasticsearch index of all published files associated with datasets (not Ingest UI uploaded files)
To accomplish this need a standalone method running in the file-api given
input
: (string) single dataset uuid willoutput
: (a list of dictionaries) one dictionary per file in the dataset with the following attributes:description
- (string) the description of the file as described by resolving the file type via the directory schema for the assay type of the associated dataset from the HuBMAP application ontology.type_code
(string) OPTIONAL the code of the associated ontology term for the file as resolved through the HuBMAP application ontologyrel_path
- (string) the path relative of where the file sitessize
- (int) OPTIONAL size of the file in bytesdata_types
- (list of strings) the data_types of the assocaiated datasetdataset_uuid
- (string) the uuid of the associated datasetorgans
- (list of object) the list of organs associated with the associated dataset with the following propertiesuuid
- (string) the uuid of the organ ancestortype
- (string) the organ type of the ancestor organ, resolved to the readable description of the organtype_code
- (string) the organ code of the ancestor organfile_extension
- (string) the file extension of the filedonors
- (list of object) the list of donors associated with the associated dataset with the following propertiesuuid
- (string) the uuid of the associated donorage
- (int) OPTIONAL age in years of the associated donorrace
- (string) OPTIONAL race of the donorsamples
- (list of object) the list of all tissue Samples associated with the associated dataset with the following propertiesuuid
- (string) the uuid of the associated tissue Sampletype
- (string) the type of the sample resoved to the description of the associated tissue Samplecode
- (string) the sample code (specimen_type in the db) of the associated tissue Sample