The Klarna Product Page Dataset is a dataset of publicly available pages corresponding to products sold online on
various e-commerce websites. The dataset contains offline snapshots of 51,701 product pages collected from
8,175 distinct merchants across 8 different markets (US, GB, SE, NL, FI, NO, DE, AT) between 2018 and 2019.
On each page, analysts labelled 5 elements of interest: the price of the product, its image, its name and the add-to-cart
and go-to-cart buttons (if found). These labels are present in the HTML code as an attribute called klarna-ai-label
taking one of the values: Price
, Name
, Main picture
, Add to cart
and Cart
.
The snapshots are available in 3 formats: as MHTML files, as WebTraversalLibrary (WTL) snapshots, and as screeshots. The MHTML format is less lossy, a browser can render these pages though any Javascript on the page is lost. The WTL snapshots are produced by loading the MHTML pages into a chromium-based browser. To keep the WTL dataset compact, the screenshots of the rendered MTHML are provided separately; here we provide the HTML of the rendered DOM tree and additional page and element metadata with rendering information (bounding boxes of elements, font sizes etc.). The folder structure of the screenshot dataset is identical to the one the WTL dataset and can be used to complete the WTL snapshots with image information. For convenience, the datasets are provided with a train/test split in which no merchants in the test set are present in the training set.
For more information about the contents of the datasets (statistics etc.) please refer to the following ArXiv paper.
If you found this dataset useful in your research, please cite the paper as follows:
@misc{hotti2024klarnaproductpagedataset,
title={The Klarna Product Page Dataset: Web Element Nomination with Graph Neural Networks and Large Language Models},
author={Alexandra Hotti and Riccardo Sven Risuleo and Stefan Magureanu and Aref Moradi and Jens Lagergren},
year={2024},
eprint={2111.02168},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2111.02168},
}
UPDATE: The hosting platform of the dataset has been changed, and the WTL, MHTML, and screenshot formats of the dataset can now be downloaded here via Zenodo.
The dataset was created for the purpose of benchmarking representation learning algorithms on the task of web element prediction on e-commerce websites. Specifically, the dataset provides algorithms with a large-scale, diverse and realistic corpus of labelled product pages containing both DOM-tree representations but also page screenshots. Previous datasets used in evaluations of DOM-element prediction algorithms contained pages generated from very few (only 80) templates. The Klarna Product Page Dataset contains a 51,701 product pages belonging to 8,175 websites, allowing algorithms to potentially learn more complex and abstract representations on one vertical - e-commerce product pages.
The dataset is created by the Web Automation Research team within Klarna.
The dataset contains offline copies of publicly available e-commerce product web pages.
The dataset contains 51,701 product pages belonging to 8,175 websites over 8 different markets.
The dataset is a sample of all existing product pages available in 8 markets. Due to limited resources, the dataset is not representative of the wider set of all possible product web pages you might encounter over the internet since:
Each example consists of a MHTML or WebTraversalLibrary (WTL) clone of the loaded page. A screenshot is provided for the examples that render correctly in a mobile browser.
On each page, there are 5 elements of interest: the price of the product, its image, its name and the add-to-cart
and go-to-cart buttons (if found).
These labels are present in the HTML code as an attribute called klarna-ai-label
taking one of the values:
Price
, Name
, Main picture
, Add to cart
and Cart
.
Some pages might be empty due to errors when loading.
Yes, pages belonging to the same market appear in the same directory. Pages belonging to the same merchant also appear in the same directory.
Yes, the dataset is presented in a pre-made split train/test split (with ratios 0.8, 0.2). The recommended split is done such that no merchant appears in more than 1 set so that no webpage template is present in 2 sets in order to gauge the algorithms' ability to generalise.
Add to cart
button could trigger the same event.Yes. The dataset is made available via the Registry of Open Data on AWS.
No. Dataset contains only publicly available webpages.
The data was directly observable.
The sampling strategy was deterministic for merchants - the product pages were sampled curated by the analysts labelling the pages with a bias towards pages corresponding to product that did not require configuring (e.g. one-size fits all, default size selection etc.).
The data instances in the dataset were collected between 2018 and 2019.
In the case of the MHTML dataset, no. For the dataset in the WTL, the element metadata additionally contains the local node text, extracted from the source HTML. The screenshots were manually reviewed and instances of pages not rendered correctly (e.g., where overlays, menus, or cookie dialog covered product information) were removed.