mrpowers-io / levi

Delta Lake helper methods. No Spark dependency.
MIT License
22 stars 8 forks source link

add md5 function #29

Open jeffbrennan opened 2 months ago

jeffbrennan commented 2 months ago

addresses #21

I have a working implementation of an appended md5 column using the standard library hashlib.md5 function.

This likely won't work on larger tables as implementation includes a pylist collection and a list comprehension where the md5 hash is computed per-value before being appended to the pyarrow table.

Open to suggestions for improvement!


One thing that is currently unhandled is columns containing null values. The binary_join_element_wise documentation has support for

  1. emitting a null for the full concatenation (default)
  2. skipping the null element
  3. replacing the null element with a string

I'm in favor of either 2 or 3 - let me know your thoughts

jeffbrennan commented 2 months ago

@MrPowers