Investigate having human-readable labels instead of term URIs in the output .tsvs

alyssadai commented 1 year ago

Why

As a user, when I download metadata of my results, I would like the discrete values in the table to be human readable (i.e. "Parkinson's disease" rather than "snomed:49049000"), so that I can directly us the data in a script and don't have to go and look up what these things mean like a machine.

What

We should figure out what the best way is to make this happen. Here are some unsorted ideas:

Pull all vocabularies into the graph and use rdfs:label or something to add human label. Then return only the human readable labels from the API
Have a separate API endpoint that gives you the label for a given term (and the query tool makes a bunch of requests?)
Reuse existing term resolution (are they called that) services for that? How fast is that?
Combine this work with changing what the API returns. E.g. currently the first query already obtains a massive (uncompressed) JSON blob with all the metadata. Maybe we only need this when an actual download is triggered? Then the querying of the terms could happen (by the API?)

Context

Currently, the subject-level tsv generated by the query tool still contains URIs instead of human-readable labels, unlike the reference output examples provided in the documentation https://github.com/neurobagel/documentation/wiki/Query-Tool#example-data

I definitely remember us discussing previously that we wanted to implement human-readable labels in the outputs, eventually having them in the graph in the first place (https://github.com/neurobagel/project/issues/47) but for now using the "hard-coded mapping of controlled term IRI to human-readable label that already exists in the query tool" (see https://github.com/neurobagel/query-tool/issues/76).

I do think these human-readable labels will be much more user-friendly/compelling as well as useful for verifying the downloaded results of a query, and may be worth prioritizing for our upcoming demos depending on the amount of work needed.

See related:

https://github.com/neurobagel/query-tool/issues/76:

As part of this, we need to make sure that we return the human-readable labels to the users, not the IRI for diagnosis levels and sex.
https://github.com/neurobagel/project/issues/122
152 because the examples should update automatically
https://github.com/neurobagel/api/issues/37

Conclusion / Outcome

The query tool should continue handling / being aware of the unique termIRIs internally, but should return human readable labels to the user (optionally: in addition to the unique termIRIs). In order for this to be possible, a couple of things need to happen:

the API I am sending my requests to needs another endpoint that:
- can receive a termIRI as a query parameter / request body
- and returns the human readable label in response
when the query tool generates the response string, it needs to now also send requests to this API endpoint to retrieve the human readable label for these terms and use them instead of the termIRIs (or in addition to)

Related but not the same problem:

the query tool should not have to hardcode all the possible choices for a variable (e.g. "Diagnosis") but instead rely on the API to provide a list of terms that have actually been used in the dataset
an additional API endpoint should exist for each variable (e.g. "Diagnosis") that responds with a list of terms that have been used in the graph
- these terms can be either returned as termIRIs, forcing the query tool to make an additional request to get their label
- or the terms can be returned as an object of termIRI + human label.

alyssadai commented 1 year ago

@rmanaem, any thoughts on how difficult this would be to implement? I'm guessing you'll need to fetch the list of diagnoses/assessments/etc. in the new OpenNeuro graph manually...

rmanaem commented 1 year ago

Shouldn't be too difficult since we have the labels hardcoded. What's interesting to me is that this slipped by and we forgot about it.

rmanaem commented 1 year ago

Shouldn't be too difficult since we have the labels hardcoded.

Scratch that It turns out we need some form of context to map URIs to their corresponding human-readable label as we hardcoded categorical options using prefixes and the response from API contains the full URI.

rmanaem commented 1 year ago

Unblocked as it requires discussion to figure out implementation requisites.

alyssadai commented 1 year ago

surchs commented 1 year ago

@alyssadai mind taking a look at the current spec and see if anything is missing here?

alyssadai commented 1 year ago

Hey @surchs, the description generally makes sense to me.

Combine this work with changing what the API returns. E.g. currently the first query already obtains a massive (uncompressed) JSON blob with all the metadata. Maybe we only need this when an actual download is triggered? Then the querying of the terms could happen (by the API?)

I'm not entirely sure I understand what you mean here. Could you elaborate?

surchs commented 1 year ago

I'm not entirely sure I understand what you mean here. Could you elaborate?

We're going to change query_tool <> API interaction in these ways already:

query_tool gets the dropdown options from a new API endpoint
query_tool receives results from API with human labels instead of the term IRIs

so I think we might as well add one more aspect and that is:

how and when should the API send results back to the query_tool

The reason for this last part is that currently the roundtrip from query -> API -> graph -> API -> query is very slow. And the last step (API -> query) is about half of that, maybe even more, because for every query the API returns all the results that match at the participant level with all the available metadata. So it's a huge JSON blob. But on the query tool side we only really look at the dataset-level summaries until the user actually decides to download any metadata.

So it would be reasonable to say:

when a query is sent to the API, it only returns the dataset level information
when the user starts a download of metadata (having selected some datasets), then the query tool goes and fetches the corresponding participant level metadata (as available) from the API

That would make the whole process a good bit faster as the final JSON blob being sent back by the API would likely also be much smaller.

surchs commented 1 year ago

idea -> start with cogatlas because they have an API

surchs commented 1 year ago

Blocked by Seb's lack of availability

edit: unblocked again by better understanding of scope 🤷

surchs commented 1 year ago

@alyssadai check "Conclusion" in the issue spec for a list of tasks that relate to new API term endpoints. Please edit and / or close the issue if you agree

alyssadai commented 1 year ago

@surchs since the implementation for this issue depends on larger architectural decisions for the ecosystem (e.g., as multiple tools/steps need human-readable labels), the conclusion points from the description have been absorbed into this larger issue https://github.com/neurobagel/project/issues/47. Will close this one and continue the conversation/create new issues from there.

neurobagel / old-query-tool