VECF (Visually Explainable Collaborative Filtering): represent a fashion image by learning an attention model over many pre-segmented regions. -> highlight valuable image regions by attention
1) only the binary user implicit feedback data, there is less informative in discovering where a user is interested in. 2) In selecting appropriate image segmentation, it is hard to set a unified segmentation granularity, since user preferences are usually diverse and changeable. 3) Lack of dataset for evaluating whether the provided visual explanations are reasonable. ---> a multimodal attention network with fixed region proposals for fine-grained visual preferences modeling learning user review information (powerful in reflecting user opinions and sentiments) and fashion images.
Review information is modeled by a customized LSTM model, and visual features are seamlessly infused into the word generation process. Each fashion image is directly divided into many small grids, and learned via the attention mechanism.
Creating 500 (7x7 regions) labeled image for explanation quantitive evaluation based on review data (which part the user is interested in. (cool!)
Others
not only "this item is recommended you because ...", but also "this item has nice pocket, nice wide neck opening" like a shop staff. <- the merit of visual explanation.
Basic Information
Link
https://dl.acm.org/doi/10.1145/3331184.3331254
Overview
Others
Reference (for understanding)