linnarsson-lab / loom-viewer

Tool for sharing, browsing and visualizing single-cell data stored in the Loom file format
BSD 2-Clause "Simplified" License
35 stars 6 forks source link

Decide how to handle duplicate gene names #87

Closed JobLeonard closed 7 years ago

JobLeonard commented 7 years ago

So, I just was informed by @gioelelm that gene names are no longer unique per row. That... undermines my assumptions.

This doesn't explicitly break any code: when fetching genes, the first row that matches the gene name will be fetched. But I'm not sure if that's the desired behaviour.

There's the matter of fetched genes being stored in a dictionary. That would break of course - so perhaps I should rename the genes to include a numbering. That could create problems with other gene names, but as long as I stick to a scheme that inclused a space and numbers between parenthesis I think we'd be fine (i.e. Flg would become Flg (1), Flg (2), etc.).

There's also the possibility of fetching all rows and then summing the values as if it is a single measurement.

So I see three options going forward

slinnarsson commented 7 years ago

But, we still have the equivalent of a primary key in the “Accession” field. This can be used when a single row is needed. However, for genes (“Gene”) I think fetching the first matching row is a pragmatic solution.

/Sten

-- Sten Linnarsson, PhD Professor of Molecular Systems Biology Karolinska Institutet Unit of Molecular Neurobiology Department of Medical Biochemistry and Biophysics Scheeles väg 1, 171 77 Stockholm, Sweden +46 8 52 48 75 77 (office) +46 70 399 32 06 (mobile)

On 1 Feb 2017, at 16:02, Job van der Zwan notifications@github.com wrote:

So, I just was informed by @gioelelm that gene names are no longer unique per row. That... undermines my assumptions.

This doesn't explicitly break any code: when fetching genes, the first row that matches the gene name will be fetched. But I'm not sure if that's the desired behaviour.

There's the matter of fetched genes being stored in a dictionary. That would break of course - so perhaps I should rename the genes to include a numbering. That could create problems with other gene names, but as long as I stick to a scheme that inclused a space and numbers between parenthesis I think we'd be fine (i.e. Flg would become Flg (1), Flg (2), etc.).

There's also the possibility of fetching all rows and then summing the values as if it is a single measurement.

So I see three options going forward

• keep behaviour as is, silently ignoring all rows beyond the first matching the name • fetch and sum all rows of matching genes • rename duplicate names to numbered equivalents, with some safe numbering scheme — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

JobLeonard commented 7 years ago

Ok, we'll settle for that then. Least work to do as well, it's already implemented that way :P