pytorch / torcharrow

High performance model preprocessing library on PyTorch
https://pytorch.org/torcharrow/beta/index.html
BSD 3-Clause "New" or "Revised" License
641 stars 78 forks source link

Supports More Operations for Recommendation Systems #494

Open Ash-Zheng opened 1 year ago

Ash-Zheng commented 1 year ago

Hi,

I noticed that some data preprocessing operations used in recommendation systems like bucketize, sigridHash, and firstX are implemented in: torcharrow/tree/main/csrc/velox/functions/rec

I would like to ask if other preprocessing operations for recommendation system be supported in the future? For example, recent paper from Meta[1] mentioned 16 kinds of common preprocessing operations in the Table-11 including: bucketize, sigridHash, firstX, Cartesian, IdListTransform, BoxCox, MapId, and NGram. Most of them are not supported now. Will these operations be supported in torcharrow in the future?

[1] Zhao, Mark, et al. "Understanding data storage and ingestion for large-scale deep recommendation model training: industrial product." Proceedings of the 49th Annual International Symposium on Computer Architecture. 2022.

wenleix commented 1 year ago

cc @YLGH