[Discussion] Operations needed to be supported in shards - Githubissues

intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.

Apache License 2.0

6.48k stars 1.24k forks source link

[Discussion] Operations needed to be supported in shards #5694

Open dding3 opened 2 years ago

dding3 commented 2 years ago

To support better user experience to use orca shards, created this issue to discuss which operations are needed to support in orca shards.

[ ] Scaler
- [x] minmaxscaler
- [x] standardscaler https://github.com/intel-analytics/BigDL/pull/5716
[ ] Encode categorical variables
- [x] label encoder
- [ ] onehot encoding (get_dummies in pandas)
[x] Merge (join) Has a task https://github.com/orgs/analytics-zoo/projects/14/views/4
[ ] Not (~ operation in pandas)
[ ] statisticas
- [x] missing values
  - [ ] count missing values for each column
  - [ ] delete null values
  - [ ] fill in null values (maybe various imputations)
- [x] groupby
- [x] agg
- [x] mean
- [x] max
- [x] sum
- [x] sort_values (nice to have)

Above operations are motivated from below links: https://www.kaggle.com/code/pmarcelino/comprehensive-data-exploration-with-python operations used:

isnull, sum, sort_values, standard_scaler, get_dummies
nice to have: describe(summary of dataframe), correlation, arg_sort

https://www.kaggle.com/code/isaienkov/riiid-answer-correctness-prediction-eda-modeling operations used:

isnull, sum, groupby, agg, merge, fillna, not
nice to have: sklearn.feature_selection.rfe

https://www.kaggle.com/code/ammar111/youtube-trending-videos-analysis operations used:

fillna, isna, value_counts, count, filter, groupby,
nice to have: describe, most_common, corr, sort_values

https://www.kaggle.com/code/jiashenliu/introduction-to-financial-concepts-and-data operations used:

filter, get pd series to np array and using numpy operation to process to create a new column

jason-dai commented 2 years ago

Please summarize for each example, what additional operations are needed