Multiple starting URLs - Githubissues

philippta commented 11 months ago

Multiple starting URLs should be supported by passing an array as the urls (plural) config option. Providing a single URL as string into the url (singular) config option, should continue to work for visual concistency.

If both url and urls are provided, all URLs should be added to the scraping queue.

// multiple urls
export const config = {
  urls: [
    "http://foo.com",
    "http://bar.com",
  ],
};

// single url
export const config = {
  url: "http://foo.com",
};

// both
export const config = {
  url: "http://foo.com",
  urls: [
    "http://bar.com",
    "http://baz.com",
}

Ref:

flyscrape/modules/starturl/starturl.go

rafiramadhana commented 11 months ago

@philippta this seems interesting

when we support multiple urls, does it mean we will scrape all of them? which means... the output in Example script will be something like

$ flyscrape run hackernews.js
[
  {
    "url": "https://news.ycombinator.com/",
    "data": {
      "title": "Hacker News",
      "posts": [
        {
          "title": "Show HN: flyscrape - An standalone and scriptable web scraper",
          "url": "https://flyscrape.com/"
        },
        ...
      ]
    }
  },
  {
    "url": "https://www.hackerrank.com/",
    "data": {
      "title": "Hacker Rank",
        ...
      ]
    }
  }
]

rafiramadhana commented 11 months ago

CMIIW, to resolve this, changes will happen in several places (at least):

In how we parse the js file (from url to []url)
In how we scrape the web (from scraping a url to scraping many urls)

philippta commented 11 months ago

I updated the proposal slightly to support both config options url and urls, instead of merging them into a single option.

@rafiramadhana

when we support multiple urls, does it mean we will scrape all of them? which means... the output in Example script will be something like

Yes, this is correct.

In how we parse the js file (from url to []url)

When Flyscrape initializes, it transforms the config object into a plain JSON string. Modules (in this case the starturl module) can then define what values to extract from this JSON. https://github.com/philippta/flyscrape/blob/190056ee8d6a4eca61d92a79cc25aad645e69d4a/modules/starturl/starturl.go#L15-L17

So to parse the new urls config option, you can define a new field with the correct JSON struct tag.

In how we scrape the web (from scraping a url to scraping many urls)

Flyscrape can follow links and therefore scrape many urls, so it needs no change in the underlying scraping functionality. During initialization, the starturl module adds urls to the scraping queue using the ctx.Visit function. This can be extended to loop over the new URLs field and add those as well. https://github.com/philippta/flyscrape/blob/190056ee8d6a4eca61d92a79cc25aad645e69d4a/modules/starturl/starturl.go#L26-L31

rafiramadhana commented 11 months ago

I updated the proposal slightly to support both config options url and urls, instead of merging them into a single option.

ok

Flyscrape can follow links and therefore scrape many urls, so it needs no change in the underlying scraping functionality.

i see

This can be extended to loop over the new URLs field and add those as well.

agree, i was thinking similar to this when looking at the code

rafiramadhana commented 11 months ago

@philippta i think i can work on this

first, im going to spend 1-2 days familiarize myself with flyscrape

then, if all is well, i will submit a PR

wdyt?

philippta commented 11 months ago

@rafiramadhana Sounds great. Please go ahead!

philippta / flyscrape

Multiple starting URLs #8