klarna / product-page-dataset

49 stars 10 forks source link

Price extraction from HTML pages #19

Open smartmimo opened 2 years ago

smartmimo commented 2 years ago

Hello, I'm trying to parse the data from HTML pages, I can extract everything but I have an issue with the price.

Sometimes the Price element can contain more than just the real price, the element contains some other text for example price before discount or some numbers like 50% discount, or in other cases numbers like: 20e / 100ml - 30e / 200ml. Generally it may contain numbers next to the original price.

An example of this can be found in this entry: ./data/test/AT/www.mainzoo.de/8679/source.mhtml

Not that I think it's relevant but I'm using NodeJS (cheerio) to do the parsing, here's the function for my price extraction:

function extractPriceFromString(priceString) {
    if (!priceString) return false;

    priceString = priceString.replaceAll(",", ".")
    const match = priceString.match(floatRegex)
    if (!match || !match[0]) return false;

    const price = parseFloat(match[0]);
    if (isNaN(price)) return false;

    return price;
}

console.log([...new Set($('[klarna-ai-label = "Price"]').text().split(/\s/g).filter(e => !e.startsWith("=")).map(extractPriceFromString).filter(e => e))])

It reads the text of the [klarna-ai-label = "Price"] element and splits it with spaces, filters the ones that start with = as they are URL encoded elements from the MHTML extension, and then executes the extractPriceFromString function on the array, that way we get an array with all the numbers (possible prices) in the element.

Normally, one would think a simple const price = extractPriceFromString($('[klarna-ai-label = "Price"]').text()) would suffice but it's not the case here.

To sum up, I'm asking if anybody else has the same problem, and maybe if you have an idea how to get the real price from the pages that have this inconvinence.

stefanmagureanu commented 2 years ago

Hi! That's a very good question! Unfortunately, reliably extracting price and currency from the price element's string is a hard problem itself and goes beyond the scope of our work on this problem. So I can't really offer any insights unfortunately. Would love to hear if workable solutions exist for this problem! Best wishes, Stefan