tosdr / edit.tosdr.org

👍👎 A new web app to rate services
https://edit.tosdr.org
GNU Affero General Public License v3.0
213 stars 37 forks source link

Crawl pages built dynamically by javascript #946

Closed augusto-herrmann closed 3 years ago

augusto-herrmann commented 3 years ago

I've come across a site whose terms of service and privacy policy can't be crawled because they're build dynamically with javascript.

Should edit.tosdr.org employ some technique to crawl those sites (maybe Selenium?), or dynamically loaded pages just won't be supported?

Examples:

michielbdejong commented 3 years ago

Hi! We're in the process of switching to https://github.com/tosdr/tosback-crawler so then we'll be able to use https://github.com/ambanum/CGUs/blob/656654545b95781d2b736dd96f516e6fd52f1275/scripts/validation/service.schema.js#L120 for this! :)

Phabbits commented 3 years ago

Any update on this issue? In the related Crawling Errors & Crawling Update forum topic, it was mentioned that internal server error issue was fixed yesterday.

tosdrbot commented 3 years ago

This issue has been mentioned on ToS;DR Forum. There might be relevant details there:

https://forum.tosdr.org/t/recrawling-and-deleting-documents/800/1

JustinBack commented 3 years ago

Any update on this issue? In the related Crawling Errors & Crawling Update forum topic, it was mentioned that internal server error issue was fixed yesterday.

I will try to implement a hacky solution with the current poltergeist based crawler so Javascript docs can be crawled.

JustinBack commented 3 years ago

This has been implemented in d69d71a6b585f9a53581d6baf2a3b6abca77d31b