mansenfranzen / pywrangler

Advanced data wrangling for python
https://github.com/mansenfranzen/pywrangler
MIT License
11 stars 4 forks source link

Allow naive iids for interval identifier #16

Closed mansenfranzen closed 4 years ago

mansenfranzen commented 5 years ago

Currently the result of the interval identifier distinguishes between valid and invalid intervals. All invalid intervals are assigned 0 by definition whereas all valid intervals are enumerated starting with 1. However, in some cases, it is useful to have an enumeration regardless of valid/invalid intervals.

The naive enumeration should be also less computation intensive and could be added as an optional keyword argument, for example enumeration="strict" for invalid/valid intervals and enumeration="simple" for intervals regardless of valid/invalid. Another naming proposal could be mark_invalid=True/False. The simple/naive enumeration does not need to increase in steps of one (1, 2, 3 ...). Any increasing value suffices (like 1, 3, 4, 6 ...).

Test data example


start=0
end=1
noise=-1

# cols:     order, groupby, marker, iid
data =     [[1,     1,       noise,  0],
            [2,     1,       start,  1],
            [3,     1,       end,    1],
            [4,     1,       noise,  2],

            [5,     2,       start,  1],
            [6,     2,       noise,  1],
            [7,     2,       end,    1],
            [8,     2,       noise,  2],
            [9,     2,       noise,  2],
            [10,    2,       start,  3],
            [11,    2,       noise,  3],
            [12,    2,       end,    3],
            [13,    2,       start,  4],
            [14,    2,       end,    4]]
mansenfranzen commented 4 years ago

Another benefit of naive iids is better performance because the re-enumeration from 1 to n with invalids assigned 0 is an expensive computation (especially in spark with addition window functions).