mddanishyusuf / mohddanish

Hey 👋, I'm Mohd Danish and Founder of @nocodeapi
8 stars 2 forks source link

Top 5 Open Source Libraries to Scrape Website Data #5

Open mddanishyusuf opened 5 years ago

mddanishyusuf commented 5 years ago

Scraping website data is like a magic trick that lets you extract web data without having to copy and paste. It can all be done through some lines of code if you know basic Python syntax.

Large companies are using this technology to grow their business. Even Google scraps website data to analyze content and and rank them based on the relevance to your Google search. There are many use cases of web scraping in research, e-commerce, price comparison, market analysis, and lead collection. Regardless of the problem you’re trying to solve, these 5 open source libraries will help you scrape website data.

I build a lot of scrapers to get data and I use different Python open source libraries. Based on my personal experiences, I have put together a good collection of open source libraries that I think will help you as you scrape data.

  1. Python
  2. Scrapy
  3. Newspaper
  4. Portia
  5. You-Get
  6. Robobrowser

Scrapy

Scrapy is an open source project that you can install locally on your machine through PIP. PIP is a package installer for Python. They also have Scrapy Cloud to host the crawler if you don’t want the hassle of dealing with the server setup. Scrapy also has a command line to build and run the scraper.

Features:

  1. Selecting and extracting data from HTML/XML with XPath and CSS selectors.
  2. Feed Export in JSON, CSV, and XML
  3. Command Line Interface
  4. Encoding and auto-detection
  5. Cookies and session handling
  6. User agent spoofing
  7. Media Pipeline to automatically download images from content

Newspaper

This library is specifically built to scrape articles from blogs and news websites. It let’s you extract the article’s author, date published, article content, and featured images from the article.

Features:

  1. Multi-threaded article download framework
  2. News URL identification
  3. Text extraction from HTML
  4. Top image extraction from HTML
  5. All image extraction from HTML
  6. Keyword extraction from text
  7. Summary extraction from text
  8. Author extraction from text
  9. Google trending terms extraction

Portia

Portia is a non-coding scraping tool, which means you can scrape the website data visually. If you don’t have any programming knowledge and you want to understand what scraping is, then this might be the best option for you. With Portia you can annotate a web page to identify the data you wish to extract, Portia will then understand how to scrape data from similar pages using those annotations. You can set up this tool with one single command using the Portia official Docker image.

Features

  1. Visual Interface for scraping
  2. Selecting and extracting data from HTML/XML with XPath and CSS selectors
  3. Real time view of extracted data in application
  4. Scraping multiple items from a single page
  5. Crawling paginated listings

You-Get

You-Get is a command line open source project that lets you scrap media (video, images, audio files) from websites like Youtube, Soundcloud, and Tumblr. You can view a full list of their supported sites here. This is a cool tool to install if you’re looking to download files on to your local machine from the internet using a command line tool.

Robobrowser

If there is a website that doesn’t have an API and you want to extract data without a manual login then Robobrowser is the open source project for you.

Robobrowser is a simple, pythonic library for browsing the web without a standalone web browser. Robobrowser can fetch a page, click on links and buttons on the page, and fill out and submit forms. If you need to interact with web services that don’t have APIs, RoboBrowser can help.

If you have any questions or need help collecting data for your business, send us an email! Happy scraping!