nih-sparc / sparc.client

Python client for NIH SPARC
https://docs.sparc.science/docs/sparc-python-client
Apache License 2.0
0 stars 8 forks source link

Where should I write a wrapper for a public SPARC dataset? #22

Open elvijs opened 11 months ago

elvijs commented 11 months ago

Context

I'm scoping a study that will involve a dataset upload to Pennsieve.io. I'd love to make it easy for users to interact with the data.

The problem

It looks to me like Pennsieve essentially exposes a collection of files on AWS S3 as a folder and ensures files are organised in a particular manner. This means that in order for users to do anything with the data, they need to open up the README, understand the layout of the folders and then navigate (manually or via a script) to the right files and parse them.

Potential solution

I'd like to write a companion data client that allows users to query for e.g. "Give me heart rate for subject A at clinic visit 2" (as opposed to having to manually traverse the folders or read the README file). Its job will be to expose a clean API with researcher-friendly terms and hide away the underlying folders as well as interactions with AWS S3.

Questions

nickerso commented 11 months ago

Thanks for starting this discussion @elvijs - its a great question. Enabling users to avoid having to traverse through files looking for data/knowledge is definitely something that we (in the SPARC Data and Resource Center) are also keen to support. To date, most of our focus has been on the dataset level and the achieving this on the web - i.e., the SPARC Portal (https://sparc.science) makes use of a dataset knowledge graph, where all the metadata from all the published datasets is extracted, processed, and available for querying. At the highest level, this knowledge is dumped into an Algolia index which powers the main functionality of the portal (e.g., the faceting on the browse data page, https://sparc.science/data?type=dataset, or projects, etc.). More detailed knowledge about specific files within datasets (e.g., to display images, scaffolds, segmentations via the dataset details pages and gallery viewer, https://sparc.science/datasets/77?type=dataset&datasetDetailsTab=images) sits in a different index that is (mostly) queried via the SPARC API (https://github.com/nih-sparc/sparc-api) from the portal. Part of the goal with this SPARC Python client is to look at exposing similar capabilities via a Python client - that part is not implemented yet.

This won't give you the ability to ask "Give me heart rate for Subject A at clinic visit 2" but it could give you "List all datasets that have heart rate data in them" - I think.

Part of the DRC goals for the coming year, is to extend the dataset knowledge graph right down to the file level details - and that, I think, will start to give us the capability to answer that type of query - independent of how that data is contained/spread across one or many datasets.

In terms of building a client that would provide the kind of interface you describe, I suspect a lot of the work might already be done in the SPARC curation tools (https://github.com/SciCrunch/sparc-curation), which already does a lot of trawling through datasets pulling out knowledge. A quick solution might be to teach that set of tooling about specific types of data and knowledge to pull out and index in a way that will support the specific queries you'd envision being useful?

These are just my thoughts (Andre @ MAP Core); would be interested to hear what @tgbugs and @jgrethe might have to say. And also the Pennsieve team might have some other tooling that could also be useful here - @muftring?

elvijs commented 11 months ago

Thanks, that's really helpful!

I suspect my next steps are to explore https://github.com/SciCrunch/sparc-curation and sanity-check whether it can support our queries. Will report back once done.