Closed ACharbonneau closed 3 years ago
An alternative for the user is to use the "Anatomy" facet in left panel which properly searches for stomach.
Unfortunately this is a known limitation of the table search box in chaise. As you surmised, it searches actual columns of the table which in C2M2 contain concept IDs, not human-readable term names.
In the near term, I think we could only really address this in one of two ways:
I'm nervous about trying to do either of these until a subsequent dev cycle, however...
So, we're planning to make this portal more public at the end of March with the real data release, so I don't think we need to care about for epic 2 per se. But I think it would be good to deal with for end of March, even if it is just a temp solution of hiding the search.
@ACharbonneau I want to do another round of triage on this issue. To help me, could you go through each of the main C2M2 tables and enumerate which fields you think should be included as matching material for the recordset searchbox for that table? E.g. something like the following:
file searches:
local_id
filename
creation_time
(or skip?)size_in_bytes
(or skip?)md5
(or skip?)sha256
(or skip?)id
,name
, abbreviation
id_namespace
, local_id
, name
id_namespace
, local_id
, name
, synonyms
id_namespace
, local_id
, name
, synonyms
id_namespace
, local_id
, name
, synonyms
From other projects, I have heard that things like description
of linked terms should be excluded because it would lead to too many false positive matches for many bio concepts.
I wonder whether anybody would try to search by the other numerical or machine-oriented values like timestamp, checksums, byte count, or the variouns id namespace and local id values of all the linked terms. I can see this tilting either way, e.g. some user trying to paste in a rather specific value but unfortunately these values may have lots of "random-looking" sequences in them which could be false positive matches for other short keywords in some vocabulary...
I forgot to mention another question regarding my previous comment: would you expect that any "indirectly linked" concepts should be matched by the searchbox? E.g. if searching files, would you expect biosample anatomy term names and/or subject taxonomy term names to also match if connected to a file by C2M2 relationships? And if so, via which relationship path(s)?
file
-- file_descrbes_biosample
-- biosample
file
-- file_describes_subject
-- subject
file
-- file_descrbes_biosample
-- biosample
-- biosample_from_subject
-- subject
Do you have a map of how the 'refine search' boxes work right now? There's clearly some indirect linking, but I'm not sure how to tell what path they're taking.
Facets on file
table:
file
-- data_type
vocab tablefile
-- file_format
vocab tablefile
-- assay_type
vocab tablefile
-- file_anatomy
(ETL derived table) -- anatomy
vocab tablefile
-- file_subject_role_taxonomy
(ETL derived table) -- ncbi_taxonomy
vocab tablefile
-- project
-- project_in_project_transitive (ETL derived table) --
project_root(ETL derived table) --
project`file
-- project
-- project_in_project_transitive (ETL derived table) --
project`file
-- file_subject_granularity
(ETL derived table) -- subject_granularity
vocab tablefile
-- file_subject_role_taxonomy
(ETL derived table) -- subject_role
vocab tablefile
-- file_in_collection
-- collection
-- collection_in_collection_transitive
(ETL derived table) -- collection
file
-- file_describes_biosample
-- biosample
file
-- file_describes_subject
-- subject
Facets on biosample
table:
biosample
-- biosample_assay_type
(ETL derived table) -- assay_type
vocab tablebiosample
-- anatomy
vocab tablebiosample
-- biosample_from_subject
-- subject
-- subject_role_taxonomy
-- ncbi_taxonomy
vocab tablebiosample
-- project
-- project_in_project_transitive
(ETL derived table) -- project_root
(ETL derived table) -- project
biosample
-- project
-- project_in_project_transitive
(ETL derived table) -- project
biosample
-- biosample_from_subject
-- subject
biosample
-- file_describes_biosample
-- file
biosample
-- biosample_in_collection
-- collection
-- collection_in_collection_transitive
(ETL derived table) -- collection
Facets in subject
table:
subject
-- subject_role_taxonomy
-- ncbi_taxonomy
vocab tablesubject
-- subject_granularity
vocab tablesubject
-- subject_role_taxonomy
-- subject_role
vocab tablesubject
-- project
-- project_in_project_transitive
(ETL derived table) -- root_project
(ETL derived table) -- project
subject
--project
-- project_in_project_transitive
(ETL derived table) -- project
subject
-- biosample_from_subject
-- biosample
subject
-- file_describes_subject
-- file
subject
-- subject_in_collection
-- collection
-- collection_in_collection_transitive
(ETL derived table) -- collection
From other projects, I have heard that things like description of linked terms should be excluded because it would lead to too many false positive matches for many bio concepts.
I think this will probably be true in the future, but at the moment, description is the only place that might tell you anything about disease or the study, so I'm inclined to make it searchable until the model starts including those concepts
Does the top level search all columns understand concepts like <
, >
? if not, I think we don't include time/size/similar search results. They're searchable in the facet and that's fine.
For File, my current thinking is:
local_id
filename
id
,name
, abbreviation
id_namespace
, local_id
, name
, description
id
, name
id
, name
id
, name
id
, name
id
, name
file
-- project
-- project_in_project_transitive (ETL derived table)
-- project
name
, local_id
file
-- file_subject_granularity
(ETL derived table) -- subject_granularity
vocab table name
file
-- file_subject_role_taxonomy
(ETL derived table) -- subject_role
vocab table name
file
-- file_in_collection
-- collection
-- collection_in_collection_transitive
(ETL derived table) -- collection
id_namespace
, local_id
, name
, description
I don't quite understand if this is different from namespace and/or (Super) Project:
file
-- project
-- project_in_project_transitive (ETL derived table)
-- project_root
(ETL derived table) -- project
If it's a separate concept, then id
,name
, abbreviation
The difference between Common Fund Program and (Super) Project facets is that the former is a subset consisting only of the "root projects" in the forest of projects, while the latter includes subprojects along the path.
The searchbox is only doing substring matching so does not understand ordering relationships. I agree we should leave out the time/size info from this as it likely produces confusing results for a naive user.
I was going to interpret your answer as "include id, name, description from every concept searchable by facets". Is that right? You left description off the (Super) Project chain but it is also formatted differently, so I assume it might have been an accidental difference.
Sorry. I didn't mean to format that one differently.
I was trying to make description only in places that it seemed narrow enough to be helpful. Like you said, descriptions can give too many false positives. So, I don't want description for Common Fund Program, because at that point you'll get back every file from a DCC which isn't helpful. Assuming I understand "Super Project" I think that's also too broad. If it makes sense in the context of the database, I would like to only include descriptions for sub-projects.
I would have preferred synonyms rather than descriptions for the CV terms, but we dropped synonyms. Reading through the CV descriptions again, I think I don't like them. I've edited my comment above.
Basically I'm making this up rather than having a lot of informed opinions to draw from, so thank you for the questions, it helps me clarify my thinking to me :)
I have a working prototype of this revisd searchbox behavior for the file table in this test submission on dev: https://app-dev.nih-cfde.org/chaise/record/#registry/CFDE:datapackage/RID=986 i.e. browsing https://app-dev.nih-cfde.org/chaise/recordset/#293/CFDE:file the searchbox matches indirect text as per above
As examples, try typing "perineum", "muscle", or "blood" for some anatomy matches.
@ACharbonneau Do you think it makes sense to only support "provenance" keywords for biosample and subject search boxes? E.g. subject only includes subject/project/collection matching (no biosample nor file metadata) and biosample includes biosample/subject/project/collection (no file metadata)?
This looks really good! I don't know how I would test if it is giving me all the results I would want for a search, but it is definitely giving me results that fit my expectations. Here's a first attempt at the other two:
biosample
table:
local id
id
,name
, abbreviation
id_namespace
, local_id
, name
, description
biosample
-- anatomy
vocab table id
, name
biosample
-- biosample_assay_type
(ETL derived table) -- assay_type
vocab tablebiosample
-- biosample_from_subject
-- subject
-- subject_role_taxonomy
-- ncbi_taxonomy
vocab table -- id
, name
biosample
-- project
-- project_in_project_transitive
(ETL derived table) name
, local_id
biosample
-- project
-- project_in_project_transitive
(ETL derived table) -- project_root
(ETL derived table) -- project
id
,name
, abbreviation
biosample
-- biosample_from_subject
-- subject
local_id
and the linked granularity
from vocab tablebiosample
-- file_describes_biosample
-- file
linked data_type
from vocab tablebiosample
-- biosample_in_collection
-- collection
-- collection_in_collection_transitive
(ETL derived table) -- collection
id_namespace
, local_id
, name
, description
biosample
-- file_describes_biosample
-- file
linked file_format
from vocab tablesubject
table:
local id
id
,name
, abbreviation
id_namespace
, local_id
, name
, description
id
, name
id
, name
subject
-- subject_role_taxonomy
-- subject_role
vocab table name
subject
-- project
-- project_in_project_transitive
(ETL derived table) -- root_project
(ETL derived table) -- project
id
,name
, abbreviation
subject
--project
-- project_in_project_transitive
(ETL derived table) -- project
name
, local_id
subject
-- file_describes_subject
-- file -- assay_type` vocab tablesubject
-- biosample_from_subject
-- biosample
local_id
and linked anatomy
from vocab tablesubject
-- file_describes_subject
-- file
linked data_type
from vocab tablesubject
-- subject_in_collection
-- collection
-- collection_in_collection_transitive
(ETL derived table) -- collection
id_namespace
, local_id
, name
, description
subject
-- file_describes_subject
-- file
linked file_format
from vocab tableDid you intentionally leave out the file format and assay type from those last two tables or will you probably want to add those too if you think about it...?
Just bad at things. Editing now
This submission https://app-dev.nih-cfde.org/chaise/record/#registry/CFDE:datapackage/RID=99M
Browse at https://app-dev.nih-cfde.org/chaise/recordset/#294/CFDE:file
adds searchbox customization for biosample and subject tables too
This looks good. I like that the text in the search box is changed from 'search all columns' to go with it, and it is giving me reasonable results.
@ACharbonneau Do you think it makes sense to only support "provenance" keywords for biosample and subject search boxes? E.g. subject only includes subject/project/collection matching (no biosample nor file metadata) and biosample includes biosample/subject/project/collection (no file metadata)?
Sorry I completely missed this question.
I think that the concept of provenance as you're using it is really driven by the model and how connections are made between tables, and I think that is a reasonable way to think about it from an engineering perspective, but I expect that the only people who will ever look at our model are us, and DCCs trying to bulid datapackages. I wouldn't expect that users coming to the portal would have any idea what our underlying connections are, or that they've necessarily thought deeply about what concepts make sense to search from others. They will, I think, expect that they can make cohorts of data based on connections that make sense in the context of a study. So for example, I would want to filter my potential subjects by file_type, i.e. finding subjects that fit some set of biological criteria and then only getting the ones that have CRAMs, because then I can use my workflow that accepts CRAMs as input.
All that said, I can see it being a thing that confuses users, especially since it's filtering on criteria you can't otherwise see in that page. But given that we don't have any users yet besides me and the testing team, I only really have my opinion to go on, and I like being able to search the broader metadata. If we get other feedback, I'm happy to reconsider.
Changes for improved searchbox behavior have been briefly tested on dev and merged for inclusion in upcoming releases.
@ACharbonneau I understand your UX concern so the changes are symmetric in providing connected file and biosample metadata when searching subjects, connected biosample and subject metadata when searching files, and connected file and subject metadata when searching biosamples. However, my definition of "connected metadata" is the vocabulary terms associated with connected entities. I did not include the unique identifiers/properties of connected entities. So, the searchbox can recognize identifiers for subjects while searching subjects, or controlled terms such as file format, data type, assay type, or anatomy. But, as currently configured, it will not find a subject based on a biosample identifier, file identifier, checksum, nor file name. Please continue commenting here if you think this needs adjustment or find other problems during testing of this new feature...
If I go to a page such as https://app-staging.nih-cfde.org/chaise/recordset/#1/CFDE:biosample@sort(RID)
and search by 'stomach' I get zero results, even though 'stomach' was clearly in the first row:
It appears that Deriva is displaying the content of 'Name' from the anatomy table, but only searching the 'ID' column. This means that for a user to use "search by all columns" they have to first go and lookup the Uberon ID for stomach:
This is really cumbersome and unintuitive