klarna / product-page-dataset

50 stars 10 forks source link

The Klarna Product Page Dataset

Product Pages

Description

The Klarna Product Page Dataset is a dataset of publicly available pages corresponding to products sold online on various e-commerce websites. The dataset contains offline snapshots of 51,701 product pages collected from 8,175 distinct merchants across 8 different markets (US, GB, SE, NL, FI, NO, DE, AT) between 2018 and 2019. On each page, analysts labelled 5 elements of interest: the price of the product, its image, its name and the add-to-cart and go-to-cart buttons (if found). These labels are present in the HTML code as an attribute called klarna-ai-label taking one of the values: Price, Name, Main picture, Add to cart and Cart.

The snapshots are available in 3 formats: as MHTML files, as WebTraversalLibrary (WTL) snapshots, and as screeshots. The MHTML format is less lossy, a browser can render these pages though any Javascript on the page is lost. The WTL snapshots are produced by loading the MHTML pages into a chromium-based browser. To keep the WTL dataset compact, the screenshots of the rendered MTHML are provided separately; here we provide the HTML of the rendered DOM tree and additional page and element metadata with rendering information (bounding boxes of elements, font sizes etc.). The folder structure of the screenshot dataset is identical to the one the WTL dataset and can be used to complete the WTL snapshots with image information. For convenience, the datasets are provided with a train/test split in which no merchants in the test set are present in the training set.

Corresponding Publication

For more information about the contents of the datasets (statistics etc.) please refer to the following ArXiv paper.

If you found this dataset useful in your research, please cite the paper as follows:

@misc{hotti2024klarnaproductpagedataset,
      title={The Klarna Product Page Dataset: Web Element Nomination with Graph Neural Networks and Large Language Models}, 
      author={Alexandra Hotti and Riccardo Sven Risuleo and Stefan Magureanu and Aref Moradi and Jens Lagergren},
      year={2024},
      eprint={2111.02168},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2111.02168}, 
}

Download under the Creative Commons BY-NC-SA license:

UPDATE: The hosting platform of the dataset has been changed, and the WTL, MHTML, and screenshot formats of the dataset can now be downloaded here via Zenodo.

Datasheets for Datasets Documentation

Motivation

For what purpose was the dataset created?

The dataset was created for the purpose of benchmarking representation learning algorithms on the task of web element prediction on e-commerce websites. Specifically, the dataset provides algorithms with a large-scale, diverse and realistic corpus of labelled product pages containing both DOM-tree representations but also page screenshots. Previous datasets used in evaluations of DOM-element prediction algorithms contained pages generated from very few (only 80) templates. The Klarna Product Page Dataset contains a 51,701 product pages belonging to 8,175 websites, allowing algorithms to potentially learn more complex and abstract representations on one vertical - e-commerce product pages.

Who created the dataset and on behalf of which entity?

The dataset is created by the Web Automation Research team within Klarna.

Composition

What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?

The dataset contains offline copies of publicly available e-commerce product web pages.

How many instances are there in total (of each type, if appropriate)?

The dataset contains 51,701 product pages belonging to 8,175 websites over 8 different markets.

Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?

The dataset is a sample of all existing product pages available in 8 markets. Due to limited resources, the dataset is not representative of the wider set of all possible product web pages you might encounter over the internet since:

What data does each instance consist of?

Each example consists of a MHTML or WebTraversalLibrary (WTL) clone of the loaded page. A screenshot is provided for the examples that render correctly in a mobile browser.

Is there a label or target associated with each instance?

On each page, there are 5 elements of interest: the price of the product, its image, its name and the add-to-cart and go-to-cart buttons (if found). These labels are present in the HTML code as an attribute called klarna-ai-label taking one of the values: Price, Name, Main picture, Add to cart and Cart.

Is any information missing from individual instances?

Some pages might be empty due to errors when loading.

Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)?

Yes, pages belonging to the same market appear in the same directory. Pages belonging to the same merchant also appear in the same directory.

Are there recommended data splits (e.g., training, development/validation, testing)?

Yes, the dataset is presented in a pre-made split train/test split (with ratios 0.8, 0.2). The recommended split is done such that no merchant appears in more than 1 set so that no webpage template is present in 2 sets in order to gauge the algorithms' ability to generalise.

Are there any errors, sources of noise, or redundancies in the dataset?

Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)?

Yes. The dataset is made available via the Registry of Open Data on AWS.

Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctorpatient confidentiality, data that includes the content of individuals’ non-public communications)?

No. Dataset contains only publicly available webpages.

Collection Process

How was the data associated with each instance acquired?

The data was directly observable.

If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?

The sampling strategy was deterministic for merchants - the product pages were sampled curated by the analysts labelling the pages with a bias towards pages corresponding to product that did not require configuring (e.g. one-size fits all, default size selection etc.).

Over what timeframe was the data collected?

The data instances in the dataset were collected between 2018 and 2019.

Preprocessing

Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?

In the case of the MHTML dataset, no. For the dataset in the WTL, the element metadata additionally contains the local node text, extracted from the source HTML. The screenshots were manually reviewed and instances of pages not rendered correctly (e.g., where overlays, menus, or cookie dialog covered product information) were removed.