parkeragee / content-scraper

0 stars 1 forks source link

🛑 HELP 🛑 #1

Open beingabstrac opened 4 years ago

beingabstrac commented 4 years ago

Hey @parkeragee

Got stuck, how do I do this -

  1. Open a webpage (you see list of items)
  2. Click an item - you go inside item detail page
  3. Extract data from the detail page (like name, desc, images...)
  4. Now store all this data as a markdown file (.md) with the item name as the file name
  5. Repeat the same for 300+ items

P.S. the class names are complex(with weird names and numbers). How do I use the same with XPATH?

Thanks in advance!

parkeragee commented 4 years ago

@beingabstrac it sounds like your scraping needs are a little more complex. You can reference the generateMarkdown() method in this repo to help with the markdown generation. You'll just need to construct a JSON object with your data.

If you're needing to physically emulate clicks, then you should try using something like Puppeteer. There's also a library called xray.js that might help you with the separate pages for each item issue. It handles pagination really well.

beingabstrac commented 4 years ago

@parkeragee I did try puppeteer. Was able to generate one page but was not able to go back to the root and do the same for n items. Here's the gist.

Will try x-ray once.

So, this is the site am trying to scrape. Got a list of items, click one item go in and scrape the data and save as .md file with the item name as file name. Do the same for n items.

parkeragee commented 4 years ago

I haven't checked to see if this works, so it might require some tweaking. But, here's how I would try approaching it.

/**
 * The initial scrape of the directory page to get all
 * the plugins in one list
 * @return {Array} The array of plugins with their data
 */
async function scrapePluginDirectory() {
    /**
     * Scrapes the directory page and generates a list of plugins
     * @return {Array} An array of objects that contain data about each plugin
     */

     const pluginList = [];

     // Do scraping here and push each plugin into an array of objects with the data you need
     // Example of what we would return:
     // [{ name: 'Unsplash', link: 'https://www.figma.com/community/plugin/738454987945972471' }]
}

/**
 * Takes our individual plugin link and scrapes it
 * @param {String} link The plugin page link
 * @return {Object} Your plugin data needed for the markdown page
 */
async function scrapePluginHtml(link) {
    // Scrape your individual plugin page here
    // and return your data needed for the markdown file.
}

/**
 * Takes our individual plugin data that we scraped and generates a markdown file
 * @param {Object} pluginData The plugin data
 * @return {void}
 */
async function createMarkDownFile(pluginData) {
    // Take our data, generate a markdown file with `json2md`
}

async function getPluginData(plugin) {
    const pluginData = await scrapePluginHtml(plugin.link);
    await createMarkDownFile(pluginData);
}

async function scrapeAndMakeMarkdown(pluginList) {
    /**
     * Takes the plugin list and loops over it
     * to scrape each item and generate the markdown file
     * @param {Array} pluginList The plugin list we scraped one step before
     * @return {void}
     */
    return await pluginList.map(getPluginData);
}

async function go() {
    const pluginList = await scrapePluginDirectory();
    const result = await scrapeAndMakeMarkdown(pluginList);
    return result;
}

go();