Open Kikobeats opened 3 years ago
Thanks for helping to build easier e-commerce data extraction.
Overall, e-commerce sites that I've tested that use ld+json tend to consistently contain brand, product name, and sku information in a predictable manner. Sites that opt for structured microdata without ld+json tend to be more inconsistent in how they represent brand information; with some using an element with itemprop="name|brand|sku"
or nested inside an itemtype="http://schema.org/Thing"
element, or some other yet-discovered pattern.
As of today, critical e-commerce data I'm seeking include product name, product brand, and product sku. In the near future, I may have a need for product pricing, variants, and accessories as defined in https://schema.org/Product.
Some data-gathering strategies I intend to use for products include:
@type: 'Product'
innerText
so that redundant inner HTML gets excludedBased on current Microlink features, I am able to extra product data using the prerender
and waitForTimeout
options. Here is a working demo: https://runkit.com/theetrain/microlink-mql-product-data/1.0.0
Product pages I have tested:
Has this moved anywhere in the past last years? or are you using addons like https://github.com/samirrayani/metascraper-shopping?
very keen to know more about this.
Thanks for helping to build easier e-commerce data extraction.
Overall, e-commerce sites that I've tested that use ld+json tend to consistently contain brand, product name, and sku information in a predictable manner. Sites that opt for structured microdata without ld+json tend to be more inconsistent in how they represent brand information; with some using an element with
itemprop="name|brand|sku"
or nested inside anitemtype="http://schema.org/Thing"
element, or some other yet-discovered pattern.As of today, critical e-commerce data I'm seeking include product name, product brand, and product sku. In the near future, I may have a need for product pricing, variants, and accessories as defined in https://schema.org/Product.
Some data-gathering strategies I intend to use for products include:
* [x] parse and return data from ld+json objects that use schema.org `@type: 'Product'` * [ ] Come up with schema.org microdata parsing and fallback strategies to cover as many e-commerce sites as possible, since some websites do not structure their data consistently * [ ] (feature request) conditionally retry page parsing every second, up to 5 seconds, if no products can be found. This is due to some e-commerce sites that use client-side rendering take a while to display ld+json or microdata * [ ] (feature request) have an option to parse page elements and return their [`innerText`](https://developer.mozilla.org/en-US/docs/Web/API/HTMLElement/innerText) so that redundant inner HTML gets excluded * [ ] parse and return multiple products based on offers https://schema.org/offers * [ ] Support [RDFa](https://www.w3.org/MarkUp/2009/rdfa-for-html-authors) parsing, though I have yet to come across a site that uses RDF so this could be a low priority
Based on current Microlink features, I am able to extra product data using the
prerender
andwaitForTimeout
options. Here is a working demo: https://runkit.com/theetrain/microlink-mql-product-data/1.0.0Product pages I have tested:
* https://www.walmart.com/ip/Miracle-Gro-Garden-Soil-Vegetables-and-Herbs-1-5-cu-ft/46928865?athcpid=46928865&athpgid=athenaHomepage&athcgid=dealspage-home-2524396&athznid=ItemCarouselType_BestInDeals&athieid=v1&athstid=CS020&athguid=466001f5-9a18a716-46880cef9f15260d&athancid=null&athena=true * https://www.garnier.ca/en-ca/about-our-brands/hair-care/fructis/hair-treats/garnier-fructis-nourishing-treat-with-coconut-extract-400-ml * https://www.kerastase.ca/en/collections/nutritive/3474636721832.html * https://www.lorealparis.ca/en-ca/excellence-creme/excellence-creme-f-medium-brown * https://www.staples.ca/products/2735027-en-brother-tn760-black-toner-cartridge-high-yield * https://thelionchain.com/collections/exclusive-promotions/products/the-gold-edition-trap-set * https://shop.3dtotal.com/anatomy-figure/3dtotal-anatomy-3-piece-set-of-animal-figures * https://hellostella.myshopify.com/collections/rustic-stella/products/highland-fingering-posy * https://www.toysrus.ca/en/Hot-Wheels-Sky-Crash-Tower-Track-Set/242C6973.html * https://www.homedepot.com/p/RYOBI-18-Volt-ONE-Cordless-AirStrike-18-Gauge-Brad-Nailer-Tool-Only-with-Sample-Nails-P320/203810823?MERCH=REC-_-pnf-_-312306957-_-203810823-_-N& * https://thewhiteelephantdesigns.com/collections/the-baby-shop/products/chicken-dress
https://github.com/zbicin/metascraper-shopping might have some of the goods that you are looking for.
The idea behind this issue is to determine what kind of data can be extracted and normalized across e-commerce URLs.
examples of e-commerces
(no exhausted list, we need a lot more!)