neurobagel / query-tool

User interface for searching across Neurobagel graph
https://query.neurobagel.org/
MIT License
2 stars 1 forks source link

Can the Neurobagel data structure and query interface be customized (or how complicated would it be to do so? #307

Open jsheunis opened 2 days ago

jsheunis commented 2 days ago

Hey folks! I've been reading more about and testing a local instance of Neurobagel. Firstly, nice work! The node was pretty straightforward to set up.

Now we've started exploring use cases, and I am uncertain about if or how Neurobagel can deal with what we have in mind. Basically, can Neurobagel tooling practically take any data dictionary we cook up? And can the query interface respond to this?

PS: I wasn't sure where exactly to create this issue, since it relates to multiple components; so I just selected the query-tool, but please feel free to move the issue wherever is appropriate

The data dictionary defines the semantic annotations of the columns in the Neurobagel TSV file, so my understanding is that we could technically include any arbitrary columns and annotations as long as we stick to the data dictionary specification (i.e. only categorical columns, continuous columns, or identifier columns). What I am not sure about is the nature of the identifier columns. From my understanding of the docs about the Neurobagel TSV file, rows are equivalent to "particiant-sessions", i.e. there are only two identifier columns (Identifies: participant and Identifies: session). Is this a hard requirement for bagel-cli and the query tool? Or can we include an arbitrary amount of identifier columns (a single one, or many)? If possible, how will the query interface deal with this? Automatically, or will it need development to deal with the changes? I assume that e.g. Identifies: participant has some internal mapping used in the process of generating graph-ready data, so if we e.g. say Identifies: sample or Identifies: cuteLittlePuppy the process will fail?

As noted at the end of this comment datalink/org#2 (comment), my understanding is that neurobagel has its own internal schema for subjects, sessions, images, etc., which I assume follows BIDS to a major extent. I understand that the bagel-cli can be used to generate phenotypic-only graph-ready data, i.e. a BIDS dataset does not have to accompany the process. But what happens if we still have an accompanying scientific dataset that does not conform to BIDS but we still want to make some/all of its aspects/content findable in neurobagel node via the query interface. E.g. DNA sequencing or flow cytometry data. Some aspects might be able to be mapped onto the "TSV-file/data-dictionary" paradigm as new columns, but others not.

So in summary, will neurobagel components be able to deal with this. If not out of the box, how complicated would it be to be customized? Or would it not be customizable at all?

surchs commented 16 hours ago

Hey @jsheunis, thanks for the questions! They're all very good ones :).

Let me first say something about the high-level idea. Neurobagel's goal is to make the 80% use case for cohort discovery (and the corresponding "how do I get this cohort onto my HDD now" use case) easy to use across datasets - and to demonstrate that it works. The way we have built this initial demonstration by

  1. Choosing a couple of key query variables we thought are essential for cohort discovery (age, sex, diagnosis, assessments)
  2. For each of them pick an existing standard language (e.g. CogAtlas, SNOMED, BIDS, ...)
  3. Ask everyone to map their own idiosyncratic data into this shared / common language by creating a data dictionary (that's the mapping) so we can then apply the mapping to everyone's data and get nice and harmonized data to search over

Now that we can demonstrate that the cross-dataset query (and for some the "download this cohort" bit) work, we want to expand the list of query variables based on other use cases. Very likely, what constitutes a good use case will depend a bit on the scientific community you are in (e.g. neurodegenerative disorder group looks for different specialized query params than someone focused on visual attn and so on). But even across such sub-communities, we want to encourage everyone to have the same "common" core bits, e.g. the basic demographics, maybe diagnosis, imaging modalities, etc - so that across sub-communities they can discover each other's data as well, even if just on the superficial level that overlaps.

Or said another way: our goal is to grow a common data model, and to extend the common model with sub-community specific extensions that cover specific things (e.g. specific clinical stages). But even in sub-communities we'd consider the extension to be about a use case that's shared by several sites.

All that is to say: on the technical side we didn't start this out as something that's fully configurable / where you can just use the tools and swap in a different data model and deploy it on your own. To support the sub-community extensions I mentioned, we definitely want to make the tools more configurable (so communities can build these extensions without opening issues for the core team), but we aren't currently considering a use case where you just take the tools and swap out everything about the data model internally to deploy in a one-off fully custom way. For some tools, like the annotation tool, that'll be quite easy to do. For others like the query tool and federation API, it'll require a bit more work or thought (e.g. because we currently have an internal SPARQL query template that you populate when running a query, and the data model is implicitly encoded in this template). But this is something we're planning to do, and depending on how much customization your use case needs, it might be not very tricky.

so my understanding is that we could technically include any arbitrary columns and annotations as long as we stick to the data dictionary specification (i.e. only categorical columns, continuous columns, or identifier columns).

The schema might be a little confusing because it's designed to be valid for a BIDS-only data dictionary AND for a BIDS+annotations data dictionary. E.g. the CategoricalColumn is the BIDS-only generic type, and the CategoricalNeurobagel is the BIDS+annotation type. You cannot include arbitrary annotations and still have a valid BIDS+annotation dictionary. But more importantly, as I mentioned above, adding arbitrary annotations will not be reflected in the UI or CLI tools.

What I am not sure about is the nature of the identifier columns. From my understanding of the docs about the Neurobagel TSV file, rows are equivalent to "particiant-sessions", i.e. there are only two identifier columns (Identifies: participant and Identifies: session). Is this a hard requirement for bagel-cli and the query tool? Or can we include an arbitrary amount of identifier columns (a single one, or many)? If possible, how will the query interface deal with this? Automatically, or will it need development to deal with the changes? I assume that e.g. Identifies: participant has some internal mapping used in the process of generating graph-ready data, so if we e.g. say Identifies: sample or Identifies: cuteLittlePuppy the process will fail?

Could you say more about what you're trying to do? At the moment, you can have either just "participant-identifier" or "participant-identifier AND session-identifier" columns. We know that for some datasets, a single identifier for participant is not sufficient, e.g. because in that dataset there are multiple ID systems. But I'm not sure if that's what you are asking about here? Generally speaking: the data model is participant-centric, so we always need to know what a participant is, i.e. what their unique identifier is.

If you have a dictionary with cuteLittlePuppy as identifies keyword, you would indeed not pass the validation, but even if you did the CLI, the API and the query tool would not understand what you expect to happen in response to having a column be about cuteLittlePuppy.

As noted at the end of this comment datalink/org#2 (comment), my understanding is that neurobagel has its own internal schema for subjects, sessions, images, etc., which I assume follows BIDS to a major extent.

We follow the BIDS spec for data dictionaries to maximize compatibility. There is a second model for what the graph looks like that we create from the data dictionaries, and that's independent of BIDS.

I understand that the bagel-cli can be used to generate phenotypic-only graph-ready data, i.e. a BIDS dataset does not have to accompany the process.

Yes

But what happens if we still have an accompanying scientific dataset that does not conform to BIDS but we still want to make some/all of its aspects/content findable in neurobagel node via the query interface.

That depends. If you care about raw imaging, the only way we learn about availability of raw imaging data atm is by asking pyBIDS (in fact the bagel-cli for imaging data is mostly a pyBIDS wrapper). For derivatives/processed data, because there is to our knowledge no standard for that, we rely on a tabular format for input (that's currently in development) with an existing schema. If you skip the whole "annotate -> data dictionary -> bagel-cli -> jsonld/graph file" workflow and just create the graph file directly according to our schema here it's essentially up to you how you want to decide / encode if a subject has imaging data available - you would only need to make sure that you use the same controlled terms in the graph to refer to imaging modalities etc, otherwise the queries / APIs / query tool will not work with your graph files. The reason we rely on pyBIDS is that it makes it easier for us to then do the part where we tell datalad which files to provision when someone finds the subject in a cohort query. But that's the only direct link to BIDS. I feel like we'd need to chat about that a little more so I understand what your constraints / goals are.

E.g. DNA sequencing or flow cytometry data. Some aspects might be able to be mapped onto the "TSV-file/data-dictionary" paradigm as new columns, but others not.

Yeah, that would be rough to map to tsv+dictionary. The short answer is: you cannot search DNA sequence info / variant info with Neurobagel (yet). We are chatting quite a bit with the GA4GH folks who have thought about how to discover such things in a lot of detail, and I would go from the standards and protocols they have in mind when we do add DNA info.

So in summary, will neurobagel components be able to deal with this. If not out of the box, how complicated would it be to be customized? Or would it not be customizable at all?

At the moment, you can't customize the tools to use different data models. We want to add that so we can support use cases for sub-communities (e.g. with our friends who do PD research ...). Generally such extensions of the data model are easy to trivial from the graph's perspective, and somewhere between very easy to moderately involved for the other tools (e.g. we're just adding ability to model "has this subject been preprocessed by freesurfer 7.3.2?" - and that's pretty simple). Since you seem to have a couple of use cases that fall outside of what you can do right now, I think it'd be good to chat with you about how we can make these extensions easier.

surchs commented 16 hours ago

Happy to continue this conversation on https://hub.datalad.org/datalink/org/issues/2#issue-21 if you tell me how to sign up for that server :)

jsheunis commented 2 hours ago

Thanks for the detailed response!

All that is to say: on the technical side we didn't start this out as something that's fully configurable / where you can just use the tools and swap in a different data model and deploy it on your own. To support the sub-community extensions I mentioned, we definitely want to make the tools more configurable

Understandable, and good to hear that some level of configurability is the eventual goal.

But more importantly, as I mentioned above, adding arbitrary annotations will not be reflected in the UI or CLI tools.

Ok, good to know this explicitly.

Could you say more about what you're trying to do? At the moment, you can have either just "participant-identifier" or "participant-identifier AND session-identifier" columns. We know that for some datasets, a single identifier for participant is not sufficient, e.g. because in that dataset there are multiple ID systems. But I'm not sure if that's what you are asking about here? Generally speaking: the data model is participant-centric, so we always need to know what a participant is, i.e. what their unique identifier is.

If you have a dictionary with cuteLittlePuppy as identifies keyword, you would indeed not pass the validation, but even if you did the CLI, the API and the query tool would not understand what you expect to happen in response to having a column be about cuteLittlePuppy.

I gave a stupid example. The point that I was trying to understand is whether the schema is, like you say, participant-centric and whether other data entities need to map onto that in order to yield valid graph-ready data. In the consortia that we work with, this will often be the case, but not always. And users might not want to query a node with a focus on participants, but rather on e.g. samples. I think a good example of a consortium we might work with is one we have actually worked with: https://www.crc1451.uni-koeln.de/. We have a data catalog of data contributed from different groups in the consortium, https://data.sfb1451.de/, and it could eventually be good to be able to query metadata of this catalog via neurobagel. If you browse through the catalog you'll see that there are imaging datasets from patients/participants, but also data collected from individual or groups of neurons, where participants aren't even mentioned. Or e.g. microarray analysis on dissections of the CNS of mice (if that even makes sense, I'm no expert), or spike times of stimulation of cockroach brains. Users might want to find all datasets containing spike time measurements, irrespective of the type of animal or cell it was taken from. That's why I think I focused on the question of identifier columns, because we might model a measurement or a sample or "data entry" that receive specific IDs.

Since you seem to have a couple of use cases that fall outside of what you can do right now, I think it'd be good to chat with you about how we can make these extensions easier.

Agreed, let's do that. Will contact you. Thanks again!