philippta / flyscrape

Flyscrape is a command-line web scraping tool designed for those without advanced programming skills.
https://flyscrape.com
Mozilla Public License 2.0
1.02k stars 29 forks source link

suggestions #31

Closed dynabler closed 7 months ago

dynabler commented 7 months ago

Just started using this scraper and have some suggestions. I'm mostly familiar with point-and-click webscrapers, but I'm hitting limitations, hence moving on to scriptable scrapers. My suggestions are in bold, in case of TLRD.

flyscrape run hackernews.js > hackernews.json

Regards,

philippta commented 7 months ago

Hey @dynabler! Thanks for taking the time reaching out and providing this enormous feedback. I highly appreciate that.

In the doc files, there's no mention of collected data being saved or where it's stored. Could this be added?

Yes, definitely. The docs have been lacking behind a bit but I am aware how important they are.

I only managed to get a JSON by using this in command-line:

For a while now it is possible to specify the output filename and output format either in the script itself or via command line arguments.

# Set the output format to ndjson.
$ flyscrape run example.js --output.format ndjson

# Write the output to a file.
$ flyscrape run example.js --output.file results.json

The equivalent configuration would be:

export const config = {
    output: {
        file: "results.json",
    format: "ndjson"
    }
}

Currently there are only two supported output formats:

I've considered and tried to implement other options like CSV but couldn't get the details right about handling nested data. Perhaps this will come in the future when things are more clear, but I can't say when.

The website seems to be targeted at developers only, which is fine. However, there's also a large group of DIY coders that just need to get things working and are beginners. [...] Perhaps having a website footer with various links (about us, contact, suggestions etc.) would be helpful.

Thanks for bringing this up, this is a great idea. When working on the docs I will try to make it more beginner friendly and add some contact options. Possibly my email address would be a good start.

It's a good idea to also have a rotating header (besides the already existing rotating proxies). Some websites don't have anti-scrape measures, but the webmaster does check the logs for weird same IP same header info.

I can't quite follow what you mean by rotating header. Do you mean the user-agent (i.e. the browser identification)? Couldn't webmasters just block your IP address if they wanted?

On a side note: I have always wondered what practical examples of anti-scraping measures look like. Do you know of any server-software or websites that have these in place?

Does the rotating proxies ditch the blocked ones? (I think crawlee has such a feature). If not, it's a suggestion.

At this moment no. Good suggestion, will put it on the list.

In the same manner of rotating headers and proxies, maybe a rotating rate limit is also an excellent idea, if it only had 0.001 sec difference. Some websites only use the request rate for (temporary) blocking.

This is interesting as I have never heard of this before. I think a more generic approach for this would be adding "jitter" to the request delay.

You can always offer hosted proxy service when done or when looking for beta testers. Currently, this is my last challenge.

Perfect timing as I have just finished building the Flyscrape hosted proxy service. It's still in early staged and missing a lot of marketing material. But if you like you can be the no. 1 beta tester.

Here some features:

Head over to https://app.flyscrape.com/login and sign in using GitHub. You will see a dashboard with all instruction on how to use the proxies.

I'm looking forward to hearing back from you.

image (for anyone seeing this. the api key in this screenshot has been burned)

dynabler commented 7 months ago

Hey @dynabler! Thanks for taking the time reaching out and providing this enormous feedback. I highly appreciate that.

Thank you for your reply and this software. Please keep in mind, I'm more familiar with point-and-click web scrapers, so the codes and suggestions below are not from an experienced coder.

I've considered and tried to implement other options like CSV but couldn't get the details right about handling nested data. Perhaps this will come in the future when things are more clear, but I can't say when.

Forgive me if I misunderstand nested data. Nested data: You can group them. Example:

Name: John Doe Address: Hollywood Blvd. Phone: 012345678

would become in a CSV cell: [{"Name":"John Doe"},{"Address":"Hollywood Blvd."},{"Phone":"012345678"}]

A user can create a dictionary from that one cell, and output it in any way. Example:

{{ $data := input.csv }}
{{ $nap := dict "name" 1 "address" 2 "phone" 3 }}
  {{ range $data }}
     name: {{ $nap 1}}
   // rest of the code
  {{ end }}

You can also do Name:John Doe-Address:Hollywood Blvd.-Phone:012345678 In spreadsheet software, OpenOffice/Excel, a user can then do Text to Data conversion, which splits one column into multiple columns based on a separator, in this case a dash. After that, a user can remove Name, Address, and Phone and add it to header row.

In any case, if it's in a cell, users can do whatever needed to get it working. webscraper.io uses the first example, a json thing within a CSV cell.

It's a good idea to also have a rotating header (besides the already existing rotating proxies). Some websites don't have anti-scrape measures, but the webmaster does check the logs for weird same IP same header info.

I can't quite follow what you mean by rotating header. Do you mean the user-agent (i.e. the browser identification)? Couldn't webmasters just block your IP address if they wanted?

Yes, I mean browser identification or fingerprinting?. Sure, they can block an IP address, but do you know any webmaster that makes that decision lightly? ;-) What if they're mistaken? Webmasters are not network experts. However, they can block a very specific header. I can still access a website, but cannot download the images. Could be totally unrelated, I was just thinking, if everything a scraper can be identified by gets rotated, it should be cheaper to overcome low level anti scraping methods than using proxies.

On a side note: I have always wondered what practical examples of anti-scraping measures look like. Do you know of any server-software or websites that have these in place?

I will take “practical” in a broad sense of the word. Just on top of my mind: — Detecting rhythm of reload/request/clicking. If a user/system request every 5 sec, consistently, it shows a CAPTCHA for example. — Changing the CSS selectors once in a while. — One very aggressive anti-scraping I have come across by accident is having an invisible div laid over the whole page, and this div is fetched dynamically and pretty much prevents anything. Checkout pages use this, mostly on online subscription websites. — Cloudflare uses Server Side Excludes (SSE). A webmaster can use <!--sse--><!--/sse--> to protect parts of their website from bots. Cloudflare Scrapeshield — Disabling sitemap.xml, so a user doesn't have an overview of all URLs. A second method is that the server only serves the sitemap to allowed bots, by checking if an honest bot is doing the request. If not, a user sees 404. link — Removing list pages. A website consist mostly of single pages (about-us, contact, product info) and list pages (one-page listing all the blog posts). By removing the list page, the webmaster prevents quick discovery of the links. A large website without a list page for its biggest section is one anti-scraping methods. This method is very effective. Once, it took me more than a year to find a loophole. — Having noarchive meta tag, which prevents archive.org from listing all the urls. Or having the urls removed from archive.org

One method won't work on its own. It's usually a combination of multiple anti-scraping methods. I haven't seen any that uses AI, but I'm sure it's out there. All can be defeated with proxies.

Anti Scraping Methods. Multiple levels of anti-scraping, with rate-limiting being the easiest to implement by webmasters (hence my jitter suggestion below). Then it's fingerprinting, which has levels of its own (my suggestion for browser identification rotation).

In the same manner of rotating headers and proxies, maybe a rotating rate limit is also an excellent idea, if it only had 0.001 sec difference. Some websites only use the request rate for (temporary) blocking.

This is interesting as I have never heard of this before. I think a more generic approach for this would be adding "jitter" to the request delay.

You read my mind.

philippta commented 7 months ago

A number of improvements have been created as individual tickets.

dynabler commented 6 months ago

One method won't work on its own. It's usually a combination of multiple anti-scraping methods. I haven't seen any that uses AI, but I'm sure it's out there. All can be defeated with proxies.

I finally found 2 that use AI, Google Captcha and a far more sophisticated one hcaptcha.

Here is also a list of all the captcha's available:

Normal CAPTCHAs take about 2 sec to kick in, which also means it's mostly used for forms and logins. It's not used often for website visits, because of how disruptive it is for user experience. Anything with CAPTCHA in the name (except hCAPTCHA) takes about 28 seconds to kick in and hCAPTCHA & others takes about 18 seconds to kick in.

Twitter, Facebook, Amazon etc. can have their own custom captcha or scraper detection, considering the vast amount of resources they have available. Facebook takes 8 hours to kick in, with a 3-day ban?

On a side note, I have come across a website that used ReCaptcha and hCAPTCHA at the same time. It kicked in just shy of 3 min. I guess the moral of the story is, slow down scraping? In that case, flyscrape is still not slow enough! ;-)