Closed philippta closed 11 months ago
@philippta this seems interesting
when we support multiple urls, does it mean we will scrape all of them? which means... the output in Example script will be something like
$ flyscrape run hackernews.js
[
{
"url": "https://news.ycombinator.com/",
"data": {
"title": "Hacker News",
"posts": [
{
"title": "Show HN: flyscrape - An standalone and scriptable web scraper",
"url": "https://flyscrape.com/"
},
...
]
}
},
{
"url": "https://www.hackerrank.com/",
"data": {
"title": "Hacker Rank",
...
]
}
}
]
CMIIW, to resolve this, changes will happen in several places (at least):
url
to []url
)I updated the proposal slightly to support both config options url
and urls
, instead of merging them into a single option.
@rafiramadhana
when we support multiple urls, does it mean we will scrape all of them? which means... the output in Example script will be something like
Yes, this is correct.
- In how we parse the js file (from
url
to[]url
)
When Flyscrape initializes, it transforms the config
object into a plain JSON string.
Modules (in this case the starturl
module) can then define what values to extract from this JSON.
https://github.com/philippta/flyscrape/blob/190056ee8d6a4eca61d92a79cc25aad645e69d4a/modules/starturl/starturl.go#L15-L17
So to parse the new urls
config option, you can define a new field with the correct JSON struct tag.
- In how we scrape the web (from scraping a url to scraping many urls)
Flyscrape can follow links and therefore scrape many urls, so it needs no change in the underlying scraping functionality.
During initialization, the starturl
module adds urls to the scraping queue using the ctx.Visit
function. This can be extended to loop over the new URLs field and add those as well.
https://github.com/philippta/flyscrape/blob/190056ee8d6a4eca61d92a79cc25aad645e69d4a/modules/starturl/starturl.go#L26-L31
I updated the proposal slightly to support both config options url and urls, instead of merging them into a single option.
ok
Flyscrape can follow links and therefore scrape many urls, so it needs no change in the underlying scraping functionality.
i see
This can be extended to loop over the new URLs field and add those as well.
agree, i was thinking similar to this when looking at the code
@philippta i think i can work on this
first, im going to spend 1-2 days familiarize myself with flyscrape
then, if all is well, i will submit a PR
wdyt?
@rafiramadhana Sounds great. Please go ahead!
Multiple starting URLs should be supported by passing an array as the
urls
(plural) config option. Providing a single URL as string into theurl
(singular) config option, should continue to work for visual concistency.If both
url
andurls
are provided, all URLs should be added to the scraping queue.Ref: