For small datasets the metrics worked correctly and matched with the output dataset, but for larger datasets the metrics would be really large numbers.
After a worker had finished, we would continue to sum its metrics if any other worker is running. This means that for large datasets, the workers are unlikely to finish at the same time, and the metrics would continue summing and explode in size.
PR Checklist
[x] My PR is less than 500 lines of code
[X] I have added sufficient comment as docstrings in my code
[X] I have made corresponding changes to the documentation
Summary
Fix for issue https://github.com/sambanova/generative_data_prep/issues/53
For small datasets the metrics worked correctly and matched with the output dataset, but for larger datasets the metrics would be really large numbers.
After a worker had finished, we would continue to sum its metrics if any other worker is running. This means that for large datasets, the workers are unlikely to finish at the same time, and the metrics would continue summing and explode in size.
PR Checklist