"object_type" column in ExternalResources may not be sufficient

rly commented 1 year ago

We added "object_type" in the objects table in ExternalResources to make queries easier.

But in DynamicTables, the "object_type" would be "VectorData" which is very generic and using that would pick up a lot of false positives, so it does not make queries for annotations of table columns any easier.

oruebel commented 1 year ago

CC @bendichter @mavaylon1

mavaylon1 commented 1 year ago

I believe the initial idea was to search for more specific structures such as Subject and that the user would want to see a variety of options to narrow down a search for whatever they were looking for. If we want to query something more specific then I think that should come from the feedback of the community. I am not sure of the kind of queries they would want. Query by name? (Assuming they know what they are looking for by name).

oruebel commented 1 year ago

I believe the initial idea was to search for more specific structures such as Subject

Correct. The issue with tables is that the columns are typically just generic VectorData so that will typically be too generic for query.

mavaylon1 commented 9 months ago

@oruebel @rly This has been quiet for a bit and that's my fault. Let's restart the conversation with a question: How would a user search for something they want? In the case above, the "object_type" column is generic. If they want to look for a specific column the only thing they could get would be all objects that are VectorData. Even so, what are they trying to look for? Are they searching by name of the column? Are they searching for all columns that have a certain value?

I think it might be best to think of the "object_type" column as high level, i.e. finding Subjects, or specific TimeSeries subclasses, etc. Instead of changing the structure (because it's already a lot to look at for user) let's maybe shift to adding more query abilities. Thoughts?

oruebel commented 9 months ago

I think it may be worth separating the issue of search from the main HERD data structure. Common field for search would probably be object_type, name, but could also include other, properties (e.g., the description etc.). Maybe instead of adding object_type to the main ObjectTable we could come up with a strategy to allow the user to specify which properties of objects should be "cached" with HERD to speed up search. Ultimately, the main reason to have this in HERD is to avoid having to open a large number of files to do the search. However, it's not clear to me that this is necessarily something we should do in HERD.

Option 1: Since HERD is being serialized to tsv one solution may be to have a separate, optional table object_properties that would store additional information about objects for search (e.g,. object_type and name). If a user could add custom columns to that table, then I think that would help address this issue. By placing this information into a separate table would have the advantage that it would help separate the desire for search from the core HERD data structure and at the same time make search based on specific properties easy.

Option 2: An alternative approach could also be to store a separate JSON file in HERD (which would have the same length as the ObjectTable and would store of each object a flat dict of key/value pairs. In this way each object can have it's own set of key/value pairs that it needs to expose for search.

These are just some ideas. We should discuss this issue a bit more. At first glance, I think Option 2 would be most flexible and may also be easiest to implement, but may not be optimal in terms of search performance (but probably still reasonable given the expected size).

hdmf-dev / hdmf-common-schema

"object_type" column in ExternalResources may not be sufficient #71