Open mortonjt opened 9 years ago
Probably should clarify.
An example where this problem blows up is when you are dealing with categorical distance matrices. When you have categorical metadata variable and you want to generate a distance matrix from it, you actually need to generate multiple distance matrices.
For example, say that there is a metadata variable sex
that has 3 categories male
, female
and other
. 3 more binary indicator variables have to be created namely
ismale = (sex == 'male')
isfemale = (sex == 'female')
isother = (sex == 'other')
Then from each pair of those indicator variables, I can create a DistanceMatrix
Labels have to be generated as these distance matrices are being created. But imagine that there are several dozen variables that create hundreds of variables. Keeping track of all of those labels begins to become unwieldy.
If some instance variable say Distance.name
becomes available, then this problem disappears.
When creating categorical distance matrices, I can couple the distance matrix with an identifier in the name
variable, so that I don't have to keep track of which distance matrix corresponds to which categorical subvariable.
:+1:
Another example of where this would be useful is pwmantel
and mrm
(#1104), which have an optional labels
parameter for labeling distance matrices in the results.
We could solve this more generally by supporting arbitrary metadata
on DistanceMatrix
, similar to scikit-bio's sequence classes and (soon to be) TabularMSA
. Then functions that need to name/label distance matrices can let the user decide what piece of metadata to use (or just specify their own labeling). This is similar to TabularMSA
having an optional key
parameter that lets users label sequences in a general way (with shortcuts in place for keying via sequence metadata).
That seems like a very reasonable solution.
@mortonjt why not just issue your_dm.name = 'foo'
? Is this still an issue in practice?
I'm beginning to see this problem to crop up in multiple places where its becoming difficult to keep track of which metadata variables correspond to which DistanceMatrices.
It may be a good idea to consider adding in an optional string identifier in the DistanceMatrix / DissimiliarityMatrix constructor so that one doesn't have to additional book-keeping to identify which DistanceMatrix belongs to which variable.
Definitely think we should discuss this when we refactor the DistanceMatrix object.