ncihtan / htan-portal

The HTAN Data Portal
https://humantumoratlas.org
10 stars 12 forks source link

Enable display and search of channel names #636

Open adamjtaylor opened 6 months ago

adamjtaylor commented 6 months ago

Objective:

Implement a feature on the HTAN portal to display harmonized target names for multiplexed tissue imaging data. This aims to assist researchers in easily locating and identifying datasets with specific antibody markers.

User stories:

As a cancer researcher interested in HTAN multiplexed tissue imaging data, I want to view a list of antibody targets and channels for images on the HTAN portal and use filters to search for datasets based on these attributes, so that I can easily locate datasets with specific markers relevant to my research.

As a cancer researcher, I want to identify HTAN imaging datasets where antibodies CD45, CD8, and CD4 were targeted, so that I can specifically identify cytotoxic and helper T cell populations for my studies.

Background:

Currently, channel metadata is not easily exposed or searchable by users. Additionally it was not validated at ingestion so is poorly structured. @adamjtaylor is exploring an LLM approach with Lama3 for harmonizing target names that seems promising. To support this work, and provide a MVP solution for users, this issue focuses on creating a method to display these names effectively on the portal.

For the MVP:

Looking Ahead:

Eventually, we want to incorporate these target names directly into the dataset metadata. Starting with this simpler display feature will help us lay the groundwork for future enhancements.

adamjtaylor commented 6 months ago

@inodb lets have a quick think about what mapping file setup would be best and think about any backend changes needed to enable this - I am hoping this is simply a join operation between the mapping file and the master JSON

adamjtaylor commented 6 months ago

One option would be a mapping file like this

{
  "syn1234": ["Target1","Target2"] 
  "syn53284675": ["DNA", "CD8", "CD45"."CD4", "Ki-67"],
},

I think this seems extensible enough to start with the original as provided target names and switch to harmonized ones in due course.

adamjtaylor commented 6 months ago

The following Big Query gets us a table close to what we need:

SELECT 
    e.entityId,
    cm.Channel_Metadata_ID, 
    STRING_AGG(attribute.attributeValue, ", ") AS channel_names,

FROM 
    `htan-dcc.ISB_CGC_r5.channel_metadata` cm,
    UNNEST(cm.channel_attributes) AS attribute
INNER JOIN 
    `htan-dcc.released.entities_v5_1` e ON cm.Channel_Metadata_ID = e.channel_metadata_synapseId
WHERE 
    attribute.attributeName = 'Channel Name'
AND attribute.attributeValue NOT IN  ('Red','Green','Blue')
GROUP BY 
    cm.Channel_Metadata_ID, e.entityId
Screenshot 2024-05-08 at 4 14 47 PM
adamjtaylor commented 5 months ago

@inodb I'd like to move forward with discussing how to implement this portal side so I can ensure outputs are prepared correctly.

inodb commented 5 months ago

@adamjtaylor the bigquery table looks good to me! We already have a way to pull from BigQuery directly and store it, so I don't think you need to provide anything else

adamjtaylor commented 5 months ago

OK. So I will look to push back a new table to BQ that has entityId, Channel_Metadata_ID, and a new column harmonized_channel_names

I'll point you to that once complete