raylutz / daffodil

Python Daffodil: 2-D data arrays with mixed types, lightweight package, can be faster than Pandas
MIT License
7 stars 2 forks source link

Key dictionary construction #6

Closed raylutz closed 6 months ago

raylutz commented 6 months ago

Key dictionary construction is performed for both the hd and kd. The kd may be very large. Therefore, it will be helpful for this to be as fast as possible.

Options:

  1. dict comprehension with enumerated loop:
        kd = {key: index for index, key in enumerate(key_col)}
  1. use generator function
        kd = dict(with_index(key_col))

def with_index(iterable):
    """Like enumerate, but (val, i) tuples instead of (i, val)."""
    for i, item in enumerate(iterable):
        yield (item, i)
  1. dict zip
kd = dict(zip(key_col, range(len(key_col))))
raylutz commented 6 months ago

tested with this code:

import timeit

# Sample list of keys
key_col = list(range(1000))

# Option 1: Dict comprehension with enumerated loop
def option1():
    return {key: index for index, key in enumerate(key_col)}

# Option 2: Using a generator function
def with_index(iterable):
    """Like enumerate, but (val, i) tuples instead of (i, val)."""
    for i, item in enumerate(iterable):
        yield (item, i)

def option2():
    return dict(with_index(key_col))

# Option 3: Using dict zip
def option3():
    return dict(zip(key_col, range(len(key_col))))

# Measure execution time for each option
time_option1 = timeit.timeit(option1, number=10000)
time_option2 = timeit.timeit(option2, number=10000)
time_option3 = timeit.timeit(option3, number=10000)

# Print results
print("Option 1 execution time:", time_option1)
print("Option 2 execution time:", time_option2)
print("Option 3 execution time:", time_option3)

result:

Option 1 execution time: 0.7322188999969512 Option 2 execution time: 1.1231585000059567 Option 3 execution time: 0.4843974999967031

raylutz commented 6 months ago

Conclusion: Change to dict-zip-range approach. This appears to be faster because the algorithm uses "invisible indexes" that are created from the range() statement and then combing with the keys without ever resolving those to any python identifiers.

Option 1 is 60.13% slower than Option 3. Option 2 is 119.56% slower than Option 3.

raylutz commented 6 months ago

fixed in https://github.com/raylutz/daffodil/commit/5d461b905a28c5633da5fe647609c155d24a7ea3