Generate bigger synthetic data using per-batch generation [regression only]

jfomhover commented 2 years ago

This implements a synthetic data generator not constrained by the memory limit (but still constrained by disk).

This works by creating a synthetic data generator that can produce batches of random data. This generator is being iterated on to create the required amount of data for training, testing and inferencing and append all batched sequentially.

This is still limited by disk allocation for now.

github-actions[bot] commented 2 years ago

Unit Test Results for Build

  1 files   1 suites 1m 6s :stopwatch: 97 tests 97 :heavy_check_mark: 0 :zzz: 0 :x:

Results for commit 99b87e13.

:recycle: This comment has been updated with latest results.

github-actions[bot] commented 2 years ago

Package	Line Rate	Branch Rate
common	88%	0%
pipelines.azureml	83%	0%
scripts	100%	0%
scripts.data_processing.generate_data	93%	0%
scripts.data_processing.lightgbm_data2bin	95%	0%
scripts.data_processing.partition_data	92%	0%
scripts.inferencing.custom_win_cli	94%	0%
scripts.inferencing.lightgbm_c_api	75%	0%
scripts.inferencing.lightgbm_python	95%	0%
scripts.inferencing.treelite_python	94%	0%
scripts.model_transformation.treelite_compile	92%	0%
scripts.sample	93%	0%
scripts.training.lightgbm_python	80%	0%
Summary	87% (1516 / 1733)	0% (0 / 0)

microsoft / lightgbm-benchmark

Generate bigger synthetic data using per-batch generation [regression only] #211

Unit Test Results for Build