scikit-bio / scikit-bio

scikit-bio: a community-driven Python library for bioinformatics, providing versatile data structures, algorithms and educational resources.
https://scikit.bio
BSD 3-Clause "New" or "Revised" License
894 stars 268 forks source link

Adding optional identifiers to DistanceMatrix #1108

Open mortonjt opened 9 years ago

mortonjt commented 9 years ago

I'm beginning to see this problem to crop up in multiple places where its becoming difficult to keep track of which metadata variables correspond to which DistanceMatrices.

It may be a good idea to consider adding in an optional string identifier in the DistanceMatrix / DissimiliarityMatrix constructor so that one doesn't have to additional book-keeping to identify which DistanceMatrix belongs to which variable.

Definitely think we should discuss this when we refactor the DistanceMatrix object.

mortonjt commented 9 years ago

Probably should clarify.

An example where this problem blows up is when you are dealing with categorical distance matrices. When you have categorical metadata variable and you want to generate a distance matrix from it, you actually need to generate multiple distance matrices.

For example, say that there is a metadata variable sex that has 3 categories male, female and other. 3 more binary indicator variables have to be created namely

ismale = (sex == 'male')
isfemale = (sex == 'female')
isother = (sex == 'other')

Then from each pair of those indicator variables, I can create a DistanceMatrix

Labels have to be generated as these distance matrices are being created. But imagine that there are several dozen variables that create hundreds of variables. Keeping track of all of those labels begins to become unwieldy.

If some instance variable say Distance.name becomes available, then this problem disappears. When creating categorical distance matrices, I can couple the distance matrix with an identifier in the name variable, so that I don't have to keep track of which distance matrix corresponds to which categorical subvariable.

jairideout commented 9 years ago

:+1:

Another example of where this would be useful is pwmantel and mrm (#1104), which have an optional labels parameter for labeling distance matrices in the results.

We could solve this more generally by supporting arbitrary metadata on DistanceMatrix, similar to scikit-bio's sequence classes and (soon to be) TabularMSA. Then functions that need to name/label distance matrices can let the user decide what piece of metadata to use (or just specify their own labeling). This is similar to TabularMSA having an optional key parameter that lets users label sequences in a general way (with shortcuts in place for keying via sequence metadata).

mortonjt commented 9 years ago

That seems like a very reasonable solution.

wasade commented 2 years ago

@mortonjt why not just issue your_dm.name = 'foo'? Is this still an issue in practice?