posit-dev / great-tables

Make awesome display tables using Python.
https://posit-dev.github.io/great-tables/
MIT License
1.42k stars 48 forks source link

Refactor `seq_groups()` to accept `Iterable` #365

Closed jrycw closed 1 month ago

jrycw commented 1 month ago

Our project is progressing steadily, and it's been an enjoyable journey as one of the contributors. However, I've noticed that we might be overly reliant on the list type in the codebase. While list is excellent and suitable for prototyping, it can become problematic as the codebase grows. Therefore, I propose replacing list with Generator where possible.

I have chosen seq_groups() as a starting point since it is currently functioning as a kind of Generator. Currently, we request the input as a list, but I believe we can improve this by refactoring it to accept an Iterable as input.

I hope this PR makes sense, and I welcome input from the team on this concept.

codecov-commenter commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 85.98%. Comparing base (e3649d2) to head (b2a98e0).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #365 +/- ## ========================================== + Coverage 85.90% 85.98% +0.08% ========================================== Files 41 41 Lines 4328 4339 +11 ========================================== + Hits 3718 3731 +13 + Misses 610 608 -2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

jrycw commented 1 month ago

snakeviz

From a performance perspective, using an iterator might not offer significant gains. However, there are notable benefits. We avoid the need to convert data structures into lists just for indexing with []. By accepting any iterable, we maintain flexibility, a concept I learned from the itertools module. This approach can also save memory since we don't need to materialize everything into a list, even though memory consumption might not be a major concern for processing small data in our use case. By the way, thanks for introducing me to snakeviz, another new and useful package.

machow commented 1 month ago

These are all fair points, especially that memory usage tends to be better with iterators. I definitely think there will be cases where iterators lead to big improvements. The one thing I feel cautious about when considering iterators, is that developers tend to use lists, even in situations when they could have used iterators.

I think there's likely something about lists that people find a bit easier to understand, that tends to push them in that direction (whether it's the ease of seeing the data when debugging vs a stream, or something else). It seems to be on the desire-path for developers.

In any event, at this point I think @rich-iannone and I trust your judgement on which pieces could be improved by iterators. So definitely don't take this as discouraging changing anything where it seems like they'd be useful :)

jrycw commented 1 month ago

@machow, thanks for sharing your thoughts. I completely agree with your points. I would aim for a function that:

Under these conditions, it might be a good candidate to take an iterator as input and return an iterator as the output.