philippta / flyscrape

Flyscrape is a command-line web scraping tool designed for those without advanced programming skills.
https://flyscrape.com
Mozilla Public License 2.0
1.02k stars 29 forks source link

URL pagination #38

Closed dynabler closed 7 months ago

dynabler commented 7 months ago

I have a suggestion for enhancement: URL pagination.

Most websites have list-pages with a layout as follows:

https://example.com/products/shoes/page/1
https://example.com/products/shoes/page/2
https://example.com/products/shoes/page/3

Or

https://example.com/products/shoes/?searchquery=color&page1
https://example.com/products/shoes/?searchquery=color&page2
https://example.com/products/shoes/?searchquery=color&page3

Or

https://example.com/products/shoes/page1.html
https://example.com/products/shoes/page2.html
https://example.com/products/shoes/page3.html

My suggestion is to allow making the numbers a counter like this:

https://example.com/products/shoes/page/[1-3]
https://example.com/products/shoes/?searchquery=red&page[1-3]
https://example.com/products/shoes/page[1-3].html*

I haven't seen the last one with .html as a counter, so not sure if it can be done.

Caveat: page 1 is rarely (read, never) shown as such in the URL, since it's the first page of a section. It shows as if it's an index.html. But page 1 often does exist if you type it in the address bar.

If page 1 is really an index.html, we can always go with Fly Scrape multiple-start-url:

urls: [
"https://example.com/products/shoes/page1/",
"https://example.com/products/shoes/page[2-3]",
],

Reasons for this enhancement:

from:
querystring = {//API JSON, "currentPage":"1"}}`

to:
for x in range(1,12):
querystring = {//API JSON, "currentPage":f"{x}"}}

or not touching anything but the URL, since that one can be defined on its own:

url: "https://example.com/api/?page=[1-3]",

Sometimes an API allows you to define how many items a single request should return. You can then calculate the number of request. For example, an API has 1300 items. You can request 50 items per request or 100 per request. The number of request is then 26 or 13:

url: "https://example.com/api/?items=50&page=[1-26]",
url: "https://example.com/api/?items=100&page=[1-13]",

In combination with filters (if an API has them), you limit the number of request to only what you need, saving lots of time.

https://example.com/api/?items=50&filter=color&page=[1-5]
philippta commented 7 months ago

Thanks for the detailed suggestion, I can see the use-case is very valid.

Aside from all the technical challanges this would bring at parsing and interpreting these URLs, I don't see this as a core feature for flyscrape, as you can achieve this today already with a small Javascript function.

export const config = {
  urls: [                          
    "https://example.com/products/shoes/",
    ...range("https://example.com/products/shoes/page{}.html", 1, 3)

    "https://example.com/products/hats/",
    ...range("https://example.com/products/hats/page{}.html", 1, 3)
  ],
};

// range function turns urls like: http://example.com/{} into a list like so:
// - http://example.com/1
// - http://example.com/2
// - http://example.com/3
// - ...
function range(url, from, to) {
  return Array.from({length: to - from + 1}).map((_, i) => url.replace("{}", i + from));
}

export default function({ doc, absoluteURL }) {
  // ...
}