intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.47k stars 1.24k forks source link

Chronos: Some new API suggestions for `TSDataset` #6054

Open liangs6212 opened 1 year ago

liangs6212 commented 1 year ago

Why we need this change?

Poor performance of pandas, easier to use with fewer cascade calls.

How we can modify?

  1. roll+to_numpy: This is a very classic combination that must be called almost every time. For users, they probably don't need to know what roll is, since we can probably just keep to_numpy or numpy. Btw, these changes should not affect the use of to_torch_data_loader.
# API change
from bigdl.chronos.data import TSDataset
tsdata = TSDataset.from_pandas(..., lookback=48, horizon=1, with_split=False)
x, y = tsdata.to_numpy()  # like to_torch_data_loader
  1. Optimize some existing APIs: Perhaps too many cascade calls are not necessary, we can change some cascade calls to properties. Classified according to framework, with some usage given.
Category pandas tsfresh scikit-learn other
Method deduplicate/impute/resample gen_dt_feature/gen_global_feature/gen_rolling_feature scale/unscale/unscale_numpy to_tf_dataset/to_numpy/to_torch_data_loader/to_pandas
Advice Change to attributes No change Calling scale will change the source data, can we leave the original data unchanged so we don't need unscale and unscale_numpy either? Merge roll(exclude to_pandas/to_torch_data_loader)
# Change pandas-related methods to attributes.
tsdata = TSDataset.from_pandas(..., impute=True, impute_mode="const",
                               const_num=0, deduplicate=True,
                               resample=True, interval='s', start_time=None,
                               end_time=None, merge_mode='mean', with_split=False)
  1. We can use Descriptor and Property to manage properties and methods, more info, please refer to #5656.
    
    @property
    def get_cycle_length(self):
    cycle_length = (...)
    return cycle_length

@get_cycle_length.setattr def get_cycle_length(self, instance, value):

Check for illegal input

if not isinstance(value, str):
    raise error
return cycle_length

Usage

tsdataset.get_cycle_length = 'min' # Set the mode of cycle_length.


4. Because of the poor performance of pandas, we can add `polars` as a new backend, `polars` has good parallel performance and supports the lazy API.
```python
tsdata = TSDataset.from_pandas(df, ..., use_polars=True)

pandas and polars performance comparison: https://h2oai.github.io/db-benchmark/ Differences between pandas and polars:

  1. polars does not have indexes.
  2. groupby can only return a single data column.
liangs6212 commented 1 year ago

I'm not sure these changes are necessary, please have a look. @TheaperDeng @rnwang04 @plusbang ,