sambanova / generative_data_prep

Apache License 2.0
58 stars 8 forks source link

Fix bug where for larger datasets the metrics are incorrect #79

Closed snova-zoltanc closed 9 months ago

snova-zoltanc commented 9 months ago

Summary

Fix for issue https://github.com/sambanova/generative_data_prep/issues/53

For small datasets the metrics worked correctly and matched with the output dataset, but for larger datasets the metrics would be really large numbers.

After a worker had finished, we would continue to sum its metrics if any other worker is running. This means that for large datasets, the workers are unlikely to finish at the same time, and the metrics would continue summing and explode in size.

PR Checklist

snova-zoltanc commented 9 months ago

LGTM! We should add a test for this as well.

Yeah lets do that in a future PR :-)