Open loditzz opened 7 months ago
How to build your first web scraper from scratch
In this article you are going to learn how to build your first web scraper using nodejs. We are gonna use the puppeteer library to navigate thro the page and extract the info we need.
First thing is to start a new nodejs project. If you are not familiar with nodejs projects and never made one, I recommed you learn how to install node from this article before we get started. You are gonna need to configure nodejs on you machine first.
First: start a new project in nodejs
npm init -y
Second: Install puppeteer package
npm install puppeteer
Third: lets start our project by creating a new file, lets call index.js
We are gonna start by importing puppeter.
Then we are gonna launch the browser. This will open a new browser for puppeter to work on.
browser = await puppeteer.launch({ headless: true, // change to false to *see* the browser });
We then initialiaze a new page, which can be seen as a new tab on our browser;
const page = await browser.newPage();
Now we are going to navigate to the page we wish to scrape. In this exemple we are going to navigate to wikipedia's page on data scraping.
await page.goto("https://pt.wikipedia.org/wiki/Raspagem_de_dados");
Now we are going to wait until the desired html selector is available
This is important because when we go to a page it can take a while until all the javascript and html on the page is loaded. Wait until the selector is loaded will avoid errors when triyng to scrape a page that is not fully loaded yet
await page.waitForSelector(textSelector);
We are no gonna use a puppeteer feature called evaluate. This to simulate exactly what you can do on browser's console const pageText = await page.evaluate((selector) =>{ return document.querySelector(selector).textContent.trim() // return the text from article }, textSelector);
if you never played around with browsers console and/or is not familiar with html and javascript for frontend, I sugest you read this and start learning about html selectors, javascript functions and how to use them on browsers console.
Now our pageText variable contains a text. We can now log it and return at the end of our file: console.log({ pageText });
But before retuning the variable, you should close the browser to correctly finish you function.
await browser.close();
return "All done"
O que o artigo precisa ter: