spider-rs / spider

A web crawler and scraper for Rust
https://spider.cloud
MIT License
1.16k stars 101 forks source link

Disable OpenSSL completely #209

Closed DimitriTimoz closed 2 months ago

DimitriTimoz commented 2 months ago

Despite having the reqwest_rustls_tls feature enabled and default-features set to false, OpenSSL was still used by reqwest because hyper always uses it.

To disable OpenSSL, you need to disable the default features and enable the reqwest_rustls_tls feature.

In my case, OpenSSL was the cause of numerous segmentation faults, double frees, or corruption issues.

j-mendez commented 2 months ago

Hi, thank you for the PR!

Enabling HTTP/2 and the two SSL features by default isn't ideal, as these are outside the scope of this crate. It appears you may be scraping content and running into memory issues on the server, which can lead to various problems. I recommend using "crawl" with subscriptions and performing the operations in smaller chunks instead.

Thank you for understanding!

DimitriTimoz commented 2 months ago

Yes, I understand, but I lose the utility of the smart feature and it's more expensive. And the segfaults could happen again.

j-mendez commented 2 months ago

Yes, I understand, but I lose the utility of the smart feature and it's more expensive.

And the segfaults could happen again.

There's methods for smart crawling : website.crawl_smart.

DimitriTimoz commented 2 months ago

There's methods for smart crawling : website.crawl_smart.

Yes, but I also need to retrieve the HTML content which is sometimes retrieved by an API and then inserted into the HTML.

j-mendez commented 2 months ago

There's methods for smart crawling : website.crawl_smart.

Yes, but I also need to retrieve the HTML content which is sometimes retrieved by an API and then inserted into the HTML.

Scraping doesn't change that. You still get the same html in the subscription.

DimitriTimoz commented 2 months ago

Scraping doesn't change that. You still get the same html in the subscription.

okay, too bad

DimitriTimoz commented 2 months ago

Maybe you should warn the other developers in the doc concerning the segfault when openssl is used.

j-mendez commented 2 months ago

Maybe you should warn the other developers in the doc concerning the segfault when openssl is used.

Has nothing to do with OpenSSL. You are scraping and holding onto content in a controlled memory environment. When you run out of memory or low OpenSSL near the lowest layer on the OS stack.

Use the crawl methods with subscriptions instead of the scrape calls. The scrape method and crawl methods can do everything identical.

j-mendez commented 2 months ago

Maybe you should warn the other developers in the doc concerning the segfault when openssl is used.

Has nothing to do with OpenSSL. You are scraping and holding onto content in a controlled memory environment. When you run out of memory or low OpenSSL near the lowest layer on the OS stack.

Use the crawl methods with subscriptions instead of the scrape calls. The scrape method and crawl methods can do everything identical.

If someone really wants to hold onto all the content in memory try adding some swap.

The subscription example uses crawl and you get the same page object back.

DimitriTimoz commented 2 months ago

with OpenSSL. You are scraping and holding onto content in a controlled memory

It's not a lack of memory, I have 50 GB and my program don't use more than 1.5 GB before the segfault. And I don't understand why, since I disabled OpenSSL by using rust-tls no segfault occurs.

j-mendez commented 2 months ago

with OpenSSL. You are scraping and holding onto content in a controlled memory

It's not a lack of memory, I have 50 GB and my program don't use more than 1.5 GB before the sefault.

And I don't understand why, since I disabled OpenSSL by using rust-tls no segfault occurs.

Pin the custom version of request and change from OpenSSL etc from the crate can also be done.

DimitriTimoz commented 2 months ago

with OpenSSL. You are scraping and holding onto content in a controlled memory

It's not a lack of memory, I have 50 GB and my program don't use more than 1.5 GB before the sefault. And I don't understand why, since I disabled OpenSSL by using rust-tls no segfault occurs.

Pin the custom version of request and change from OpenSSL etc from the crate can also be done.

Yes, that's what I did, but try doing a cargo tree -i openssl and you will still see openssl being used by reqwest despite your feature reqwest_rustls_tls

DimitriTimoz commented 2 months ago

Enabling HTTP/2 and the two SSL features by default isn't ideal,

Sorry but I forgot to say it. http2 is enabled by default and you are using it. Because I disabled default features of reqwest that is necessary to use rust-tls because by default default-tls is enabled. So without that your reqwest_rustls_tls feature doesn't work.

j-mendez commented 2 months ago

Enabling HTTP/2 and the two SSL features by default isn't ideal,

Sorry but I forgot to say it. http2 is enabled by default and you are using it. Because I disabled default features of reqwest that is necessary to use rust-tls because by default default-tls is enabled. So without that your reqwest_rustls_tls feature doesn't work.

Not keeping up with the flags on reqwest, they have a handful and versions change. When you use http in the configuration or set it by default it will not work for every website using `website.with_http2_prior_knowledge(true). Not aware if that exact flag will enable that by default for everyone at the latest version etc.

When pinning the version of the deps from a crate the way spider is set building will override the crates.