[Refactor] generator.py to enhance gpu memory util usage

Changes

Added new command-line arguments:
- --batch_size: Allows specifying the batch size for processing data.
- --num_workers: Allows specifying the number of worker processes for data loading.
Utilized PyTorch's DataLoader and Dataset classes:
- [x] Implemented a custom QuestionDataset class to wrap the df_questions DataFrame and provide an interface for accessing individual samples.
- [x] Used DataLoader to efficiently load and batch the data, enabling parallel data loading with multiple worker processes.
- [x] Used ThreadPoolExecutor to process batches of data in parallel, leveraging multiple CPU cores.
- [x] Increased num_workers in the DataLoader to enable multi-threaded data loading and overlap data loading with GPU computation.
- [x] Added prefetch_factor=2 to the DataLoader to prefetch batches to the GPU memory before they are needed for computation.
- [x] Set pin_memory=True in the DataLoader to use pinned memory for faster data transfer between CPU and GPU.
- [x] Separated the data processing logic into a process_batch function for better modularity and readability.
- [x] Introduced a collate_fn to handle batching of data in the DataLoader.

Simple Benchmark

with maywell/TinyWand-kiqu it takes around 40 minutes to finish to generate the output, but with this changes it only takes 18 min 59 secs.

Example Run

python generator_2.py --gpu_devices 1 --model maywell/TinyWand-kiqu --template ./templates/template-EEVE.json --model_len 2048 --batch_size 512 --num_workers 128

instructkr / LogicKor

[Refactor] generator.py to enhance gpu memory util usage #10

Changes

Simple Benchmark

Example Run