tusharojha / web_scraper

A very basic web scraper implementation to scrap html elements from a web page.
https://pub.dev/packages/web_scraper
Apache License 2.0
79 stars 42 forks source link

SocketExeption: Connection Failed, on Flutter Desktop and Flutter Web #59

Open Hecsall opened 3 years ago

Hecsall commented 3 years ago

Hi @tusharojha, I was trying to add desktop and web support to my existing flutter project when I noticed that the web_scraper package throws an error when calling the loadWebPage method.

The specific error is: SocketException: Connection failed (OS Error: Operation not permitted, errno = 1), address = m*******.com, port = 443 See screenshot: imgur

Same exception happens even with a basic web_scraper usage like:

final webScraper = WebScraper('https://www.google.com');

if (await webScraper.loadWebPage('/search/')) {
    print('test');
});

I'm pretty sure it's not related to the website I'm scraping since on Android and iOS the same code works fine, maybe it's not fully implemented for desktop and web usage? Or am I missing something?

Thanks!

Hecsall commented 3 years ago

I've got some updates regarding this issue. Today I had time to look at the code and I noticed it's actually an issue coming from the http library, precisely from the usage of Client(). I tried the basic http library example, that uses directly http.get() and it works:

import 'package:http/http.dart' as http;

var url = Uri.parse('https://m*****.com/search/');
var response = await http.get(url);
print('Response body: ${response.body}');

But when I try the Client() version, also used inside web_scraper, it throws the Exception:

var client = http.Client();
try {
  var url = Uri.parse('https://m*****.com/search/');
  var response = await client.get(url);
  print(response.body);
} finally {
  client.close();
}

Fun thing, everything actually works on Windows, but not on macOS nor Flutter Web.

I'll close this issue since there's not much you can do from your side, I'll reopen this inside the http library repo.

Hecsall commented 3 years ago

Sorry to reopen this, but I did some investigation, and here's a really short version of what I managed to discover:

So I tested a simple request using directly that HttpClient built into dart:io and, as I was expecting, it fails the same way web_scraper fails.

import 'dart:io';

var url = Uri.parse("https://www.google.com/");
HttpClient client = new HttpClient();

client.getUrl(url).then((HttpClientRequest request) {
  return request.close();
}).then((HttpClientResponse response) {
  print(response);
});

As noted in my previous message, this client somehow manages to work on Flutter for Windows, Android, and iOS, but not on macOS and Web. In fact i checked dart:io documentation and at the top it states this:

Important: Browser-based apps can't use this library. Only the following can import and use the dart:io library: Servers Command-line scripts Flutter mobile apps Flutter desktop apps

So I guess it's wrong to state that web_scraper works for Flutter Web since it uses http, that uses dart:io. I'll check with the guys at the dart:io repo why macOS isn't working.

Another (probably) issue itself: (if necessary I can move this to a new issue) I looked at the code inside web_scraper, and I noticed that the http Client() is used but never closed as the http documentation says, quotes below:

http readme If you're making multiple requests to the same server, you can keep open a persistent connection by using a Client rather than making one-off requests. If you do this, make sure to close the client when you're done http documentation close() → void Closes the client and cleans up any resources associated with it. [...]

So the question is: Isn't that an issue the fact that client.close() is never called?

Hope this is somewhat useful!

tusharojha commented 3 years ago

Hi @Hecsall, Thanks for filing the issue and I appreciate your research. I am busy for a few days but I will surely check for this, not sure if client.close() is the reason for failure it would be great if you fork the repo and help me get the fact.

Other solution could be migrating the project from http client to dio library.

I will get back on this next week for sure.

Hecsall commented 3 years ago

Hi @tusharojha, thanks for the response! The issue with client.close() is not related with the issue with the http library, I was just pointing that I noticed that while doing my research, I think it could cause performance issues (not a dart expert, don't quote me on that 😂), but has nothing to do with the main issue in the title, sorry for the confusion.

I opened an issue inside Dart SDK repo for dart:io HttpClient, then I discovered that's a permission issue, permission that has to be enabled inside XCode files, so macOS issue fixed 🎉, Web compatibility still an issue. But yes, if the scope of web_scraper is to be efficient and usable on every Flutter "type" of app (Mobile, Desktop, Web), switching to another more "compatible" library it's a possibility. I also found out about universal_io that's an alternative to dart:io that should work on everything, but I'm yet to test it.

Hecsall commented 3 years ago

After further investigation, I learned that 90% of the issues with web_scraper on Flutter Web are related to CORS, since Flutter Web acts more like a server than like a client, and data requested from other websites (if not allowed) is blocked. We can see if the error is a CORS error by checking the Developer Tools inside the Google Chrome window and there will be errors like

Access to XMLHttpRequest at 'http://somewebsite.com' from origin 'http://localhost:64564' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource.

The second issue I noticed, it's that webscraper.loadWebPage() does NOT support ports specified in the URL. I opened a separate issue for that here https://github.com/tusharojha/web_scraper/issues/63

d-apps commented 2 years ago

Does anyone tested with dio or universal_io libs?