mc2-center / csbc-pson-dcc

Data coordination resources for the NCI CSBC and PS-ON consortia
1 stars 4 forks source link

Datasets with two consortium IDs #47

Closed andrewelamb closed 4 years ago

andrewelamb commented 4 years ago

These datasets have two consortium ids:

datasetId   datasetName                                                                                                 
  <chr>       <chr>                                                                                                       
1 syn21790809 An automatec microwell platform for large-scale single cell RNA-seq.                                        
2 syn21791505 RNA sequencing of lncRNAs knockdown in human pancreatic cancer cell lines                                   
3 syn21812572 RNA sequencing data                                                                                         
4 syn21811318 Single-cell microRNA-mRNA co-sequencing reveals non-genetic heterogeneity and mechanisms of microRNA regula…
5 syn21812592 Ex vivo Dynamics of Human Glioblastoma Cells in a Microvasculature-on-a-Chip System Correlates with Tumor H…
6 syn21813547 Single-cell integrative analysis of CAR-T cell activation reveals a predominantly TH1/TH2 mixed response in…

Is that correct?

bswhite commented 4 years ago

@andrewelamb , this could certainly happen in principle -- datasets and publications can be associated with multiple centers/grants and hence consortia. The first one is certainly correct -- it is associated with two different grants from Columbia, one of which is in CSBC and the other in PSON.

andrewelamb commented 4 years ago

OK, no problem, my code wasn't handling this correctly, and I wanted to make sure this was correct before making a fix.

andrewelamb commented 4 years ago

OK I lied, there is a problem! Datasets aren't associated directly with a consortium, but through a grant. The 5 I listed above are only associated with one grant each(as are all datasets I believe). In other words the way the database is set up datasets have a many-to-one relationship with grants. To do so we would need to add a table called dataset_grant to capture the new many-to-many relationship. Then instead of adding another consortium to a dataset you would need to add another grant.

bswhite commented 4 years ago

@andrewelamb the grantName and grantId columns are inconsistent. e.g., if you search for the first dataset SELECT * FROM syn21897968 where "datasetName" LIKE '%An automatec microwell%' you'll see that it has two Columbia grants under grantName, but only one under grantId. This is because grantId is an Entity. When I tried to add multiple synapse Ids in grantId, it complained that it was not an entity. So, I tried to edit the schema (using the web UI) to make this field a string. But it won't allow that complaining

Can not perform schema change on _LIST type columns for Table Entities

Sorry -- I forgot this came up in my late night hacking. I did something expedient and inconsistent. Yet another consequence of working late at night at deadlines ...

Upshot: I suspect the grantName columns are correct and grantId is not -- containing only one of the grants in grantName.

andrewelamb commented 4 years ago

@bswhite Yeah the grantId column is just entity, not a list. @jaeddy can you have entity list as a type? If not we may need to change this to stringList (or even string).

I ran into the same error as you Brian, James mentioned I needed to be in alpha mode to remove a column, maybe it's the same for changing the column type.

bswhite commented 4 years ago

@andrewelamb what is alpha mode and how do I get there? James mentioned this once, but I didn't follow up. We are having trouble editing tables (e.g., the datasets table) using the web UI -- I believe because of multi-value types/annotations. It would be fantastic if alpha mode was a way around this. Otherwise, I'm having to delete all rows in the table and re-upload it in its entirety. That's a pretty inconvenient way to update individual entries in individual rows.

andrewelamb commented 4 years ago

I don't entirely what alpha mode entails, but I believe it lets you test out features that aren't fully tested yet. You activate by going to the bottom right corner while logged into synapse and clicking the green helmet button.

bswhite commented 4 years ago

Thanks -- that didn't solve my problem. I'll ask it elsewhere.

bswhite commented 4 years ago

@jaeddy how should we handle multiple synapse IDs in grantId? This is coming up for me in annotating?

Is there an entity list? Should we use a string list? Or just a string? The latter would assume we are accessing this field using LIKE, I believe.

grantName is currently a string. It, too, will have multiple grants. Should this be changed to string list? Probably. There are 3 datasets that show up as having grantName "Center for Cancer Systems Therapeutics (CaST), Columbia University Center for Topology of Cancer Evolution and Heterogeneity" -- whereas they should instead be associated with the two grants "Center for Cancer Systems Therapeutics (CaST)" and "Columbia University Center for Topology of Cancer Evolution and Heterogeneity"

jaeddy commented 4 years ago

@bswhite - the way I've been treating multi-value columns is to only convert to List-types for things we need to facet; for everything else, I'm using a standard comma-separated string (or "|" for institutions, which may have a comma in their name).

You raise a good point though that, even if we're not exposing a column as a facet for viz/filtering on the explore page, we might still use it for querying under the hood (e.g., to link to details pages). If we can identify those cases, I can help convert grantId and others to stringList (there isn't currently an entityList type, but I think string should still be fine).

For grantName, I haven't made those true lists yet because they're so long. The way Synapse estimates maximum row size is currently dependent on just the length of strings in the array — and so blows up pretty quickly. Ziming has added a fix that also lets you specify maxListLength for a column as well, so the estimates are more reasonable. Still, it'd probably be good to make the change in #34 before we try to use grants in lists.

jaeddy commented 4 years ago

Fixed grantId column in the merged table for datasets with multiple grants. Also made the grant (aka grantNumber) a STRING_LIST and facet, with #34 in mind.

I added dataset <-> grant associations to this table.