martsberger / django-pivot

A module for pivoting Django Querysets
MIT License
209 stars 16 forks source link

Aggregation on annotated field fails if queryset ordered by other values + set default value #5

Closed PJCampi closed 11 months ago

PJCampi commented 6 years ago

Hey Marts,

First of all thanks for the package, it simple but very useful :)

I have a couple of comments/questions:

1. Aggregation on annotated field I have noticed that you do not reorder the QuerySet before aggregation, which causes problems when aggregating on an annotated field, if it is already ordered by another field.

For example, consider the model:

class Blog(models.Model):
    date = models.DateField()
    category = models.CharField(max_length=50)
    content = models.TextField(default='')

    class Meta:
        order_by = ['date']

If I pivot on an annotated field like:

qs = Blog.objects.annotate(period=Trunc('date', 'year'))
pivot(qs, 'period', 'category', 'content', Count)

Aggregation does not work because the 'date' field must be part of the query's group by for the ordering to work.

Reordering fixes the problem, so this will work:

qs = Blog.objects.annotate(period=Trunc('date', 'year')).order_by('period')
pivot(qs, 'period', 'category', 'content', Count)

I think it would make sense to implement this in the package.

2. Default value The default column value is None atm. Why not using 0? Alternatively could one set the default value at module level? Something like:

def _get_annotations(column, column_values, data, aggregation):
    value = data if hasattr(data, 'resolve_expression') else F(data)
    return {
        display_value: aggregation(Case(When(Q(**{column: column_value}), then=value), default=DEFAULT_VALUE))
        for column_value, display_value in column_values
    }

3. Cumulative aggregation It would be fairly easy to allow for cumulative aggregation as well, which can be useful. It can be done using the window function (+distinct):

window = Window(aggregator(data), order_by=row)

It does not work on sqlite though... I'd be happy to have a look at it if you think it could be useful.

martsberger commented 6 years ago

Thanks for your feedback, these are good comments!

#1 Aggregating annotated field and order_by

If it works to remove the ordering altogether, I could implement that, .e.g., if

qs = Blog.objects.annotate(period=Trunc('date', 'year')).order_by()
pivot(qs, 'period', 'category', 'content', Count)

fixes the issue, I could make the pivot function remove the ordering. I think this should work as the result of pivoting should be independent of order_by.

If a specific ordering is required in this case it's a bit trickier, because adding an ordering that isn't necessary in other cases would degrade performance in those cases.

#2 Default value There is a difference between None and 0 that is meaningful. If I find a bunch of records to sum up and the sum is zero, I should return the value 0. If I look for records to sum up and find no records at all, this is different. If I report 0 in the second case I'm indicating I found records that sum to zero, while if I return None I'm indicating no records were found.

Granted, in some cases people prefer to have the number 0 even in the second case I described above. There are two available ways to achieve this

pivot(qs, 'period', 'category', 'content', CountDefaultZero)

Possibly allowing a default passed into the pivot function would be nicer:
```python
pivot(qs, 'period', 'category', 'content', Count, default=0)

or maybe it unnecessarily clutters the api. I'll give it some thought.

#3 Cumulative aggregation Does MySQL support the window function or just postgres? I'm okay with having db specific extensions, just need to think about the right way to do it. Is it possible to pass in an aggregation that does this?

If you have an implementation in mind, I'd be interested in seeing it.

PJCampi commented 6 years ago

1: order_by() works and it is better than using an arbitrary ordering.

2: that's what I figured, and that's what Excel pivot does by default too (I think).

It's a valid argument for Sum, less so for e.g. Count. In practice, like in your shirt sales example, if a shirt of a particular style has not been sold, it will be represented with no row present in an inventory table. The total income for that style should ideally return 0eur.

3# I think that most SQL languages support the OVER (PARTITION BY x ORDER BY y) clause which is used by the Window function (see here for MySql). I have not tested it though, I use Postgres.

You cannot pass it as an aggregation because you need to call distinct() afterwards. Basically instead of:

qs.annotate(**{column: aggregator(data)})

you need to call:

qs.annotate(**{column: Window(aggregator(data), order_by: F(row).asc())}).distinct()

It also requires some kind of support for passing ordering on the row field. Let me know if you want something more concrete.

martsberger commented 5 years ago

Opened https://github.com/martsberger/django-pivot/issues/9 to address point 2 above.

martsberger commented 11 months ago

I opened #31 for cumulative aggregation. With that, I'm going to close this issue as all the other parts have been resolved.