pegasystems / pega-datascientist-tools

Pega Data Scientist Tools
https://github.com/pegasystems/pega-datascientist-tools/wiki
Apache License 2.0
33 stars 24 forks source link

Updated and heavily simplified anonymization script #235

Closed StijnKas closed 2 weeks ago

StijnKas commented 2 months ago

We've had an anonymization script in the tools for a little bit, but these were not performant enough on any realistic and real loads, so it was time for an update. The configuration options here are much less, but it's much more efficient.

It utilizes a two-pass approach, whereby we first output all files to batched parquet files and then loop over all parquet files to generate one single output parquet file.

Many thanks to @danielm-dk for helping improve this part.

codecov[bot] commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 64.57%. Comparing base (bbf667a) to head (70afe32). Report is 44 commits behind head on master.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #235 +/- ## ========================================== + Coverage 59.63% 64.57% +4.94% ========================================== Files 29 28 -1 Lines 3793 3498 -295 ========================================== - Hits 2262 2259 -3 + Misses 1531 1239 -292 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

StijnKas commented 1 month ago

@yusufuyanik1 would you mind giving this a review? Would like to merge it sometime soon