Revisit input formats - Githubissues

Terminology

Suppose we are interested in the toy colors in two stores S1 and S2. We count how many toys there are of each color:

S1 has 10 red toys, 12 blue, and 15 green.
S2 has 8 red toys, 20 blue, and 12 green.

The domain here would be the set of colors, namely {red, blue, green}. In each store, a color is associated with a score, namely red=10, blue=12, green=15 in S1.

Ranking order

Scores induce a ranking of colors in each store, namely <green, blue, red> in S1, and <blue, green, red> in S2. That is, scores induce a rank for each color in the domain: in S1 we have ranks red=3rd, blue=2nd, green=1st, and in S2 we have ranks red=3rd, blue=1st, green=2nd.

Ranks represent the ranking order of colors by number of toys. Note however, that we are sorting colors in descending order, but we could very well be interested in the rankings by ascending order. Ranking S1 in ascending order would yield ranking <red, blue, green>, or ranks red=1st, blue=2nd, green=3rd.

It is therefore important to note that the ranking order is implicit when we use ranks, because ranks have a natural order (1st goes before 2nd, which goes before 3rd, etc.), but when using scores the ranking order can only be implied if we know the sorting direction.

When we represent the ranking directly using the color names or IDs, as in <blue, red, green>, the ranking order is explicit in the representation.

Representation order

Note that so far we have been enumerating colors in the order red, blue, green, but this order is arbitrary. We could very well represent scores in S2 as <green=12, red=8, blue=20> or any other permutation. Therefore, the representation order is key when using scores or ranks, and it ought to be the same when representing S1 and S2, as otherwise we wouldn't know how to match numbers between them. There is thus an arbitrary representation order, that is implicitly the same for S1 and S2.

When we represent the ranking directly using the color names or IDs, the representation order matches the ranking order.

Summary

A ranking may be specified in three different ways:

ID-based specification: the representation order matches the ranking order.
Rank-based specification: the representation order is arbitrary, but implicitly induces the ranking order.
Score-based specification: the representation order is arbitrary, but implicitly induces the ranking order once the direction is known (ie. ascending or descending).

	Representation order	Ranking order
Scores	Arbitrary	Implicit (direction required)
Ranks	Arbitrary	Implicit
IDs	Explicit	Explicit

We will therefore refer to score- and rank-based representations as implicit, and to ID-based representations as explicit.

ircor v1.0

The current implementation accepts implicit representations, and includes a decreasing argument to indicate the sorting direction when using scores. The default is decreasing = TRUE.

Implicit representations are more natural to correlation coefficients, as shown by standard implementations. Explicit is more natural to similarity measures such as RBO, and are actually the only viable representation of indefinite rankings. Thus, it seems reasonable to stick to these formats, even if at the expense of a homogeneous interface across coefficients. But what about a homogeneous interface through conversions?

An issue is how to convert IDs to ranks: in the general non-conjoint case this cannot be done, unless we do it in pairs of rankings and assume a common domain including the union of both rankings (eg. <red, blue> and <purple, blue> would be assumed to be defined on the domain {red, blue, purple}, with rank representations <1, 2, NA> and <NA, 2, 1>). This is clearly problematic, because the rank representation of a ranking depends on what other ranking we use to convert, so it's not unique. The reverse conversion seems feasible, that is, from ranks to IDs.

So we have these possible conversions: score to rank and rank to ID. Rank to score would be arbitrary without further domain knowledge (ie. the sorting direction), and ID to rank can only be done with conjoint rankings. On top of that, 99% of people using correlations will use an implicit representation, while 99% of people using RBO will use explicit. Therefore, it seems sensible to stick to an implicit format for conjoint rankings (eg. tau and tauAP), and an explicit format for RBO. Conversions should be allowed when possible, such as to compute tau from explicit representations.

Another issue is how to properly encode ties in an explicit format, as it seems to require lists of lists and they are a pain to work with.

Scores vs Ranks

Typical implementations of correlation coefficients accept both scores and ranks, assuming that scores should be sorted in ascending order. To the function then, there's no difference between scores and ranks after we re-rank the input.

However, with top-weighted coefficients like tauAP one needs to be mindful of the sorting direction with scores. By using the same interface for scores and ranks, we end up with inconsistencies such as "decreasing ranks", whatever that is. Having a default value of decreasing is risky, even if the default is a sensible one (eg. effectiveness scores are ranked in decreasing order). It's also not a good idea to require it in every function because the code would be redundant, and it doesn't make sense for rank inputs anyway. Having to work with a sorting direction makes the code harder to maintain and test (eg. permutations).

If we restrict the input format to ranks only we eliminate the need to specify the sorting direction, but require score-based inputs to be converted to rank format when necessary. On the one hand, this is easy and we can even provide a nice function for it, but on the other hand it's an extra we ask from the user that, most likely, deals with scores and not ranks. However, this conversion would need to be done just once, thus reducing clutter and the chances for error. It should also be noted that some rank correlation implementations ask for ranks too.

Proposal for v2.0

Coefficients on conjoint rankings (eg. correlations) require implicit representations with ranks. Note that giving scores, where the order is ascending, should work just fine too.
Coefficients on non-conjoint rankings (eg. RBO) require explicit representations with IDs.
Conversion functions should be available for:
- Score to rank, necessary when the sorting direction is decreasing.
- Rank to ID, (with possibility to specify the domain?)
- ID to rank, for the conjoint case.

julian-urbano / ircor

Revisit input formats #5