ourcanadian / ocse-core

Core of OurCanadian Search Engine
0 stars 0 forks source link

Implement ProductSpider #3

Open rylancole opened 4 years ago

rylancole commented 4 years ago

Path ocse-core/coast_to_coast/coast_to_coast/spiders/product.py

This spider should take the URL of a product page and scrape all relevant information. This includes (but is not limited to): title, description, price, colour variants, size variants, and image sources. Ignore site-wide data, like headers, footers, etc.

This spider should be site agnostic, meaning it will work on any website no matter the structure of the page. Remember, all this spider is doing is pulling as much information as it can, not sanitizing it. You don't need to have well-named or organized data, just complete data.

For example, you must have the price in your data, but it doesn't matter if it is named "Price" or "Cost" or "CDN", that will be dealt with later. It also doesn't matter if you find multiple prices on the page, include them all and we will let the sanitization process determine what to do with them.

When tackling this issue, it may be best to break it up into multiple Pull Requests. Start simple and have the Spider fetch prices from various websites and submit that code for review. Next go after different items like titles and descriptions. Keep testing each piece of data across various websites with differing structures to make sure our solutions are robust.