vistec-AI / WangchanX

WangchanX Fine-tuning Pipeline
Apache License 2.0
35 stars 6 forks source link

Datasets #4

Open Chalermpun opened 2 months ago

Chalermpun commented 2 months ago

Requirements

Chalermpun commented 2 months ago

Pattern Design

  1. Strategy Pattern: functions can be encapsulated into separate strategy classes
  2. Template Method Pattern: create_flan_dataset function contains a series of steps for loading and processing different datasets, extract steps into a template method in a base class and define abstract methods for dataset-specific operations.
  3. Factory Pattern: Instead of directly creating dataset objects using load_dataset and load_from_disk, you can introduce a factory class responsible for creating and configuring dataset objects based on the provided parameters.
  4. Decorator Pattern: wrap dataset objects with additional functionality.
  5. Facade Pattern: encapsulates the complexity of dataset creation and provides a simplified interface to the client code.

Souce code structure

src/
│
├── datasets/
│   ├── __init__.py
│   ├── base_dataset.py
│   ├── huggingface_dataset.py
│   ├── json_dataset.py
│   ├── csv_dataset.py
│   └── ...
│
├── data_processors/
│   ├── __init__.py
│   ├── base_data_processor.py
│   ├── iapp_wiki_processor.py
│   ├── scb_translation_processor.py
│   ├── wisesight_sentiment_processor.py
│   └── ...
│
├── data_transformers/
│   ├── __init__.py
│   ├── base_data_transformer.py
│   ├── map_transformer.py
│   ├── filter_transformer.py
│   ├── rename_columns_transformer.py
│   └── ...
│
├── flan_creator/
│   ├── __init__.py
│   ├── flan_creator_base.py
│   └── flan_creator.py
│
├── utils/
│   ├── __init__.py
│   └── ...
│
├── requirements.txt
│
├── tests/
│
└── main.py

Directories

Main Script