Search all columns does not work as expected

nih-cfde / cfde-deriva

Collaboration point for miscellaneous CFDE-deriva scripts

Other

2 stars 3 forks source link

Search all columns does not work as expected #147

Closed ACharbonneau closed 3 years ago

ACharbonneau commented 3 years ago

If I go to a page such as https://app-staging.nih-cfde.org/chaise/recordset/#1/CFDE:biosample@sort(RID)

and search by 'stomach' I get zero results, even though 'stomach' was clearly in the first row:

It appears that Deriva is displaying the content of 'Name' from the anatomy table, but only searching the 'ID' column. This means that for a user to use "search by all columns" they have to first go and lookup the Uberon ID for stomach:

This is really cumbersome and unintuitive

jrchudy commented 3 years ago

An alternative for the user is to use the "Anatomy" facet in left panel which properly searches for stomach.

karlcz commented 3 years ago

Unfortunately this is a known limitation of the table search box in chaise. As you surmised, it searches actual columns of the table which in C2M2 contain concept IDs, not human-readable term names.

In the near term, I think we could only really address this in one of two ways:

try to get chaise to disable the search box and avoid the misleading UX (forcing users to use the facet controls on left)
try to add more post-processing to the ingest pipeline to augment tables with (hidden) keyword material to try to make the search match on these extra concepts

I'm nervous about trying to do either of these until a subsequent dev cycle, however...

ACharbonneau commented 3 years ago

So, we're planning to make this portal more public at the end of March with the real data release, so I don't think we need to care about for epic 2 per se. But I think it would be good to deal with for end of March, even if it is just a temp solution of hiding the search.

karlcz commented 3 years ago

@ACharbonneau I want to do another round of triage on this issue. To help me, could you go through each of the main C2M2 tables and enumerate which fields you think should be included as matching material for the recordset searchbox for that table? E.g. something like the following:

file searches:

local_id
filename
creation_time (or skip?)
size_in_bytes (or skip?)
md5 (or skip?)
sha256 (or skip?)
linked id namespace id,name, abbreviation
linked project id_namespace, local_id, name
linked file format id_namespace, local_id, name, synonyms
linked data type id_namespace, local_id, name, synonyms
linked assay type id_namespace, local_id, name, synonyms

From other projects, I have heard that things like description of linked terms should be excluded because it would lead to too many false positive matches for many bio concepts.

I wonder whether anybody would try to search by the other numerical or machine-oriented values like timestamp, checksums, byte count, or the variouns id namespace and local id values of all the linked terms. I can see this tilting either way, e.g. some user trying to paste in a rather specific value but unfortunately these values may have lots of "random-looking" sequences in them which could be false positive matches for other short keywords in some vocabulary...

karlcz commented 3 years ago

I forgot to mention another question regarding my previous comment: would you expect that any "indirectly linked" concepts should be matched by the searchbox? E.g. if searching files, would you expect biosample anatomy term names and/or subject taxonomy term names to also match if connected to a file by C2M2 relationships? And if so, via which relationship path(s)?

file -- file_descrbes_biosample -- biosample
file -- file_describes_subject -- subject
file -- file_descrbes_biosample -- biosample-- biosample_from_subject -- subject

ACharbonneau commented 3 years ago

Do you have a map of how the 'refine search' boxes work right now? There's clearly some indirect linking, but I'm not sure how to tell what path they're taking.

karlcz commented 3 years ago

Facets on file table:

Data Type: file -- data_type vocab table
File Format: file -- file_format vocab table
Assay Type: file -- assay_type vocab table
Anatomy: file -- file_anatomy (ETL derived table) -- anatomy vocab table
Taxonomy: file -- file_subject_role_taxonomy (ETL derived table) -- ncbi_taxonomy vocab table
Common Fund Program: file -- project -- project_in_project_transitive (ETL derived table) --project_root(ETL derived table) --project`
(Super) Project: file -- project -- project_in_project_transitive (ETL derived table) --project`
Subject Granularity: file -- file_subject_granularity (ETL derived table) -- subject_granularity vocab table
Subject Role: file -- file_subject_role_taxonomy (ETL derived table) -- subject_role vocab table
Part of Collection: file -- file_in_collection -- collection -- collection_in_collection_transitive (ETL derived table) -- collection
Described Biosamples: file -- file_describes_biosample -- biosample
Described Subjects: file -- file_describes_subject -- subject

Facets on biosample table:

Assay Type: biosample -- biosample_assay_type (ETL derived table) -- assay_type vocab table
Anatomy: biosample -- anatomy vocab table
Subject Taxonomy: biosample -- biosample_from_subject -- subject -- subject_role_taxonomy -- ncbi_taxonomy vocab table
Common Fund Program: biosample -- project -- project_in_project_transitive (ETL derived table) -- project_root (ETL derived table) -- project
(Super) Project: biosample -- project -- project_in_project_transitive (ETL derived table) -- project
Subject: biosample -- biosample_from_subject -- subject
File: biosample -- file_describes_biosample -- file
Part of Collection: biosample -- biosample_in_collection -- collection -- collection_in_collection_transitive (ETL derived table) -- collection

Facets in subject table:

Taxonomy: subject -- subject_role_taxonomy -- ncbi_taxonomy vocab table
Granularity: subject -- subject_granularity vocab table
Taxonomic Role: subject -- subject_role_taxonomy -- subject_role vocab table
Common Fund Program: subject -- project -- project_in_project_transitive (ETL derived table) -- root_project (ETL derived table) -- project
(Super) Project: subject --project -- project_in_project_transitive (ETL derived table) -- project
Biosample: subject -- biosample_from_subject -- biosample
File: subject -- file_describes_subject -- file
Part of Collection: subject -- subject_in_collection -- collection -- collection_in_collection_transitive (ETL derived table) -- collection

ACharbonneau commented 3 years ago

From other projects, I have heard that things like description of linked terms should be excluded because it would lead to too many false positive matches for many bio concepts.

I think this will probably be true in the future, but at the moment, description is the only place that might tell you anything about disease or the study, so I'm inclined to make it searchable until the model starts including those concepts

Does the top level search all columns understand concepts like <, >? if not, I think we don't include time/size/similar search results. They're searchable in the facet and that's fine.

For File, my current thinking is:

local_id
filename
linked id namespace id,name, abbreviation
linked project id_namespace, local_id, name, description
linked file format id, name
linked data type id, name
linked assay type id, name
linked anatomy -- file_anatomy (ETL derived table) -- id, name
linked taxonomy -- file_subject_role_taxonomy (ETL derived table) -- id, name
(Super) Project: file -- project -- project_in_project_transitive (ETL derived table) -- project name, local_id
Subject Granularity: file -- file_subject_granularity (ETL derived table) -- subject_granularity vocab table name
Subject Role: file -- file_subject_role_taxonomy (ETL derived table) -- subject_role vocab table name
Part of Collection: file -- file_in_collection -- collection -- collection_in_collection_transitive (ETL derived table) -- collection id_namespace, local_id, name, description

I don't quite understand if this is different from namespace and/or (Super) Project:

Common Fund Program: file -- project -- project_in_project_transitive (ETL derived table) -- project_root (ETL derived table) -- project

If it's a separate concept, then id,name, abbreviation

karlcz commented 3 years ago

The difference between Common Fund Program and (Super) Project facets is that the former is a subset consisting only of the "root projects" in the forest of projects, while the latter includes subprojects along the path.

The searchbox is only doing substring matching so does not understand ordering relationships. I agree we should leave out the time/size info from this as it likely produces confusing results for a naive user.

I was going to interpret your answer as "include id, name, description from every concept searchable by facets". Is that right? You left description off the (Super) Project chain but it is also formatted differently, so I assume it might have been an accidental difference.

ACharbonneau commented 3 years ago

Sorry. I didn't mean to format that one differently.

I was trying to make description only in places that it seemed narrow enough to be helpful. Like you said, descriptions can give too many false positives. So, I don't want description for Common Fund Program, because at that point you'll get back every file from a DCC which isn't helpful. Assuming I understand "Super Project" I think that's also too broad. If it makes sense in the context of the database, I would like to only include descriptions for sub-projects.

I would have preferred synonyms rather than descriptions for the CV terms, but we dropped synonyms. Reading through the CV descriptions again, I think I don't like them. I've edited my comment above.

Basically I'm making this up rather than having a lot of informed opinions to draw from, so thank you for the questions, it helps me clarify my thinking to me :)

karlcz commented 3 years ago

I have a working prototype of this revisd searchbox behavior for the file table in this test submission on dev: https://app-dev.nih-cfde.org/chaise/record/#registry/CFDE:datapackage/RID=986 i.e. browsing https://app-dev.nih-cfde.org/chaise/recordset/#293/CFDE:file the searchbox matches indirect text as per above

As examples, try typing "perineum", "muscle", or "blood" for some anatomy matches.

karlcz commented 3 years ago

@ACharbonneau Do you think it makes sense to only support "provenance" keywords for biosample and subject search boxes? E.g. subject only includes subject/project/collection matching (no biosample nor file metadata) and biosample includes biosample/subject/project/collection (no file metadata)?

ACharbonneau commented 3 years ago

This looks really good! I don't know how I would test if it is giving me all the results I would want for a search, but it is definitely giving me results that fit my expectations. Here's a first attempt at the other two:

biosample table:

local id
linked id namespace id,name, abbreviation
linked project id_namespace, local_id, name, description
Anatomy: biosample -- anatomy vocab table id, name
Assay Type: biosample -- biosample_assay_type (ETL derived table) -- assay_type vocab table
linked taxonomy biosample -- biosample_from_subject -- subject -- subject_role_taxonomy -- ncbi_taxonomy vocab table -- id, name
(Super) Project: biosample -- project -- project_in_project_transitive (ETL derived table) name, local_id
Common Fund Program: biosample -- project -- project_in_project_transitive (ETL derived table) -- project_root (ETL derived table) -- project id,name, abbreviation
Subject: biosample -- biosample_from_subject -- subject local_id and the linked granularity from vocab table
File: biosample -- file_describes_biosample -- file linked data_type from vocab table
Part of Collection: biosample -- biosample_in_collection -- collection -- collection_in_collection_transitive (ETL derived table) -- collection id_namespace, local_id, name, description
file_format: biosample -- file_describes_biosample -- file linked file_format from vocab table

subject table:

local id
linked id namespace id,name, abbreviation
linked project id_namespace, local_id, name, description
linked granularity id, name
linked taxonomy id, name
Taxonomic Role: subject -- subject_role_taxonomy -- subject_role vocab table name
Common Fund Program: subject -- project -- project_in_project_transitive (ETL derived table) -- root_project (ETL derived table) -- project id,name, abbreviation
(Super) Project: subject --project -- project_in_project_transitive (ETL derived table) -- project name, local_id
Assay Type: subject -- file_describes_subject -- file -- assay_type` vocab table
Biosample: subject -- biosample_from_subject -- biosample local_id and linked anatomy from vocab table
File: subject -- file_describes_subject -- file linked data_type from vocab table
Part of Collection: subject -- subject_in_collection -- collection -- collection_in_collection_transitive (ETL derived table) -- collection id_namespace, local_id, name, description
file_format: subject -- file_describes_subject -- file linked file_format from vocab table

karlcz commented 3 years ago

Did you intentionally leave out the file format and assay type from those last two tables or will you probably want to add those too if you think about it...?

ACharbonneau commented 3 years ago

Just bad at things. Editing now

karlcz commented 3 years ago

This submission https://app-dev.nih-cfde.org/chaise/record/#registry/CFDE:datapackage/RID=99M

Browse at https://app-dev.nih-cfde.org/chaise/recordset/#294/CFDE:file

adds searchbox customization for biosample and subject tables too

ACharbonneau commented 3 years ago

This looks good. I like that the text in the search box is changed from 'search all columns' to go with it, and it is giving me reasonable results.

ACharbonneau commented 3 years ago

@ACharbonneau Do you think it makes sense to only support "provenance" keywords for biosample and subject search boxes? E.g. subject only includes subject/project/collection matching (no biosample nor file metadata) and biosample includes biosample/subject/project/collection (no file metadata)?

Sorry I completely missed this question.

I think that the concept of provenance as you're using it is really driven by the model and how connections are made between tables, and I think that is a reasonable way to think about it from an engineering perspective, but I expect that the only people who will ever look at our model are us, and DCCs trying to bulid datapackages. I wouldn't expect that users coming to the portal would have any idea what our underlying connections are, or that they've necessarily thought deeply about what concepts make sense to search from others. They will, I think, expect that they can make cohorts of data based on connections that make sense in the context of a study. So for example, I would want to filter my potential subjects by file_type, i.e. finding subjects that fit some set of biological criteria and then only getting the ones that have CRAMs, because then I can use my workflow that accepts CRAMs as input.

All that said, I can see it being a thing that confuses users, especially since it's filtering on criteria you can't otherwise see in that page. But given that we don't have any users yet besides me and the testing team, I only really have my opinion to go on, and I like being able to search the broader metadata. If we get other feedback, I'm happy to reconsider.

karlcz commented 3 years ago

Changes for improved searchbox behavior have been briefly tested on dev and merged for inclusion in upcoming releases.

karlcz commented 3 years ago

@ACharbonneau I understand your UX concern so the changes are symmetric in providing connected file and biosample metadata when searching subjects, connected biosample and subject metadata when searching files, and connected file and subject metadata when searching biosamples. However, my definition of "connected metadata" is the vocabulary terms associated with connected entities. I did not include the unique identifiers/properties of connected entities. So, the searchbox can recognize identifiers for subjects while searching subjects, or controlled terms such as file format, data type, assay type, or anatomy. But, as currently configured, it will not find a subject based on a biosample identifier, file identifier, checksum, nor file name. Please continue commenting here if you think this needs adjustment or find other problems during testing of this new feature...