microlinkhq / metascraper

Get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more.
https://metascraper.js.org
MIT License
2.31k stars 163 forks source link

[RFC] Metascraper for e-commerce #412

Open Kikobeats opened 3 years ago

Kikobeats commented 3 years ago

The idea behind this issue is to determine what kind of data can be extracted and normalized across e-commerce URLs.

examples of e-commerces

(no exhausted list, we need a lot more!)

theetrain commented 3 years ago

Thanks for helping to build easier e-commerce data extraction.

Overall, e-commerce sites that I've tested that use ld+json tend to consistently contain brand, product name, and sku information in a predictable manner. Sites that opt for structured microdata without ld+json tend to be more inconsistent in how they represent brand information; with some using an element with itemprop="name|brand|sku" or nested inside an itemtype="http://schema.org/Thing" element, or some other yet-discovered pattern.

As of today, critical e-commerce data I'm seeking include product name, product brand, and product sku. In the near future, I may have a need for product pricing, variants, and accessories as defined in https://schema.org/Product.

Some data-gathering strategies I intend to use for products include:

Based on current Microlink features, I am able to extra product data using the prerender and waitForTimeout options. Here is a working demo: https://runkit.com/theetrain/microlink-mql-product-data/1.0.0

Product pages I have tested:

adentranter commented 7 months ago

Has this moved anywhere in the past last years? or are you using addons like https://github.com/samirrayani/metascraper-shopping?

very keen to know more about this.

adentranter commented 7 months ago

Thanks for helping to build easier e-commerce data extraction.

Overall, e-commerce sites that I've tested that use ld+json tend to consistently contain brand, product name, and sku information in a predictable manner. Sites that opt for structured microdata without ld+json tend to be more inconsistent in how they represent brand information; with some using an element with itemprop="name|brand|sku" or nested inside an itemtype="http://schema.org/Thing" element, or some other yet-discovered pattern.

As of today, critical e-commerce data I'm seeking include product name, product brand, and product sku. In the near future, I may have a need for product pricing, variants, and accessories as defined in https://schema.org/Product.

Some data-gathering strategies I intend to use for products include:

* [x]  parse and return data from ld+json objects that use schema.org `@type: 'Product'`

* [ ]  Come up with schema.org microdata parsing and fallback strategies to cover as many e-commerce sites as possible, since some websites do not structure their data consistently

* [ ]  (feature request) conditionally retry page parsing every second, up to 5 seconds, if no products can be found. This is due to some e-commerce sites that use client-side rendering take a while to display ld+json or microdata

* [ ]  (feature request) have an option to parse page elements and return their [`innerText`](https://developer.mozilla.org/en-US/docs/Web/API/HTMLElement/innerText) so that redundant inner HTML gets excluded

* [ ]   parse and return multiple products based on offers https://schema.org/offers

* [ ]  Support [RDFa](https://www.w3.org/MarkUp/2009/rdfa-for-html-authors) parsing, though I have yet to come across a site that uses RDF so this could be a low priority

Based on current Microlink features, I am able to extra product data using the prerender and waitForTimeout options. Here is a working demo: https://runkit.com/theetrain/microlink-mql-product-data/1.0.0

Product pages I have tested:

* https://www.walmart.com/ip/Miracle-Gro-Garden-Soil-Vegetables-and-Herbs-1-5-cu-ft/46928865?athcpid=46928865&athpgid=athenaHomepage&athcgid=dealspage-home-2524396&athznid=ItemCarouselType_BestInDeals&athieid=v1&athstid=CS020&athguid=466001f5-9a18a716-46880cef9f15260d&athancid=null&athena=true

* https://www.garnier.ca/en-ca/about-our-brands/hair-care/fructis/hair-treats/garnier-fructis-nourishing-treat-with-coconut-extract-400-ml

* https://www.kerastase.ca/en/collections/nutritive/3474636721832.html

* https://www.lorealparis.ca/en-ca/excellence-creme/excellence-creme-f-medium-brown

* https://www.staples.ca/products/2735027-en-brother-tn760-black-toner-cartridge-high-yield

* https://thelionchain.com/collections/exclusive-promotions/products/the-gold-edition-trap-set

* https://shop.3dtotal.com/anatomy-figure/3dtotal-anatomy-3-piece-set-of-animal-figures

* https://hellostella.myshopify.com/collections/rustic-stella/products/highland-fingering-posy

* https://www.toysrus.ca/en/Hot-Wheels-Sky-Crash-Tower-Track-Set/242C6973.html

* https://www.homedepot.com/p/RYOBI-18-Volt-ONE-Cordless-AirStrike-18-Gauge-Brad-Nailer-Tool-Only-with-Sample-Nails-P320/203810823?MERCH=REC-_-pnf-_-312306957-_-203810823-_-N&

* https://thewhiteelephantdesigns.com/collections/the-baby-shop/products/chicken-dress

https://github.com/zbicin/metascraper-shopping might have some of the goods that you are looking for.

Kikobeats commented 4 months ago

A lot more https://nrf.com/research-insights/top-retailers/top-100-retailers/top-100-retailers-2023-list