skrapeit / skrape.it

A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
https://docs.skrape.it
MIT License
789 stars 57 forks source link

[BUG] BrowserFetcher is still not working on Android #182

Closed kazemcodes closed 1 year ago

kazemcodes commented 2 years ago

Here is the error I get when using BrowseFetcher I think the error is beacuse of hunit-android

2022-04-12 21:07:05.566 5395-5451/ir.kazemcodes.infinityreader E/AndroidRuntime: FATAL EXCEPTION: DefaultDispatcher-worker-2
    Process: ir.kazemcodes.infinityreader, PID: 5395
    java.lang.NoClassDefFoundError: Failed resolution of: Ljava/awt/datatransfer/ClipboardOwner;
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.handleCharacters(HtmlUnitNekoDOMBuilder.java:593)
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.startElement(HtmlUnitNekoDOMBuilder.java:303)
        at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source:146)
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.startElement(HtmlUnitNekoDOMBuilder.java:289)
        at org.apache.xerces.parsers.AbstractXMLDocumentParser.emptyElement(Unknown Source:0)
        at net.sourceforge.htmlunit.cyberneko.HTMLTagBalancer.startElement(HTMLTagBalancer.java:812)
        at net.sourceforge.htmlunit.cyberneko.filters.DefaultFilter.startElement(DefaultFilter.java:140)
        at net.sourceforge.htmlunit.cyberneko.filters.NamespaceBinder.startElement(NamespaceBinder.java:278)
        at net.sourceforge.htmlunit.cyberneko.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.java:2811)
        at net.sourceforge.htmlunit.cyberneko.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2131)
        at net.sourceforge.htmlunit.cyberneko.HTMLScanner.scanDocument(HTMLScanner.java:937)
        at net.sourceforge.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:443)
        at net.sourceforge.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:394)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source:5)
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.parse(HtmlUnitNekoDOMBuilder.java:758)
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoHtmlParser.parse(HtmlUnitNekoHtmlParser.java:204)
        at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:298)
        at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:218)
        at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:686)
        at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:588)
        at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:506)
        at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:413)
        at it.skrape.fetcher.BrowserFetcher.fetch(BrowserFetcher.kt:19)
        at org.ireader.presentation.feature_library.presentation.LibraryScreenKt$LibraryScreen$3$2$1$1$1$2$1.invokeSuspend(LibraryScreen.kt:157)
        at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
        at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
        at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)
     Caused by: java.lang.ClassNotFoundException: Didn't find class "java.awt.datatransfer.ClipboardOwner" on path: DexPathList[[dex file "/data/data/ir.kazemcodes.infinityreader/code_cache/.overlay/base.apk/classes4.dex", dex file "/data/data/ir.kazemcodes.infinityreader/code_cache/.overlay/base.apk/classes11.dex", zip file "/data/app/~~frwX1pOecaUkVEBUDn-uGQ==/ir.kazemcodes.infinityreader-nayBeOZyEhA8jqrDIdfXeQ==/base.apk"],nativeLibraryDirectories=[/data/app/~~frwX1pOecaUkVEBUDn-uGQ==/ir.kazemcodes.infinityreader-nayBeOZyEhA8jqrDIdfXeQ==/lib/arm64, /data/app/~~frwX1pOecaUkVEBUDn-uGQ==/ir.kazemcodes.infinityreader-nayBeOZyEhA8jqrDIdfXeQ==/base.apk!/lib/arm64-v8a, /system/lib64, /system/system_ext/lib64]]
        at dalvik.system.BaseDexClassLoader.findClass(BaseDexClassLoader.java:207)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:379)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:312)
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.handleCharacters(HtmlUnitNekoDOMBuilder.java:593) 
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.startElement(HtmlUnitNekoDOMBuilder.java:303) 
        at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source:146) 
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.startElement(HtmlUnitNekoDOMBuilder.java:289) 
        at org.apache.xerces.parsers.AbstractXMLDocumentParser.emptyElement(Unknown Source:0) 
        at net.sourceforge.htmlunit.cyberneko.HTMLTagBalancer.startElement(HTMLTagBalancer.java:812) 
        at net.sourceforge.htmlunit.cyberneko.filters.DefaultFilter.startElement(DefaultFilter.java:140) 
        at net.sourceforge.htmlunit.cyberneko.filters.NamespaceBinder.startElement(NamespaceBinder.java:278) 
        at net.sourceforge.htmlunit.cyberneko.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.java:2811) 
        at net.sourceforge.htmlunit.cyberneko.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2131) 
        at net.sourceforge.htmlunit.cyberneko.HTMLScanner.scanDocument(HTMLScanner.java:937) 
        at net.sourceforge.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:443) 
        at net.sourceforge.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:394) 
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source:5) 
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.parse(HtmlUnitNekoDOMBuilder.java:758) 
        at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoHtmlParser.parse(HtmlUnitNekoHtmlParser.java:204) 
        at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:298) 
        at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:218) 
        at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:686) 
        at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:588) 
        at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:506) 
        at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:413) 
        at it.skrape.fetcher.BrowserFetcher.fetch(BrowserFetcher.kt:19) 
        at org.ireader.presentation.feature_library.presentation.LibraryScreenKt$LibraryScreen$3$2$1$1$1$2$1.invokeSuspend(LibraryScreen.kt:157) 
        at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33) 
        at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106) 
        at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571) 
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750) 
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678) 
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665) 
kazemcodes commented 2 years ago

I think the issue is because of this

https://github.com/HtmlUnit/htmlunit/issues/448

kazemcodes commented 2 years ago

can skrapeit bypass Cloudflare protection?

kazemcodes commented 2 years ago

I think the issue is because of this

HtmlUnit/htmlunit#448

the issue is already fix in the html-unit-2.59.0-SNAPHSHOT but there is a new problem in that build, not all sites throw this exception some complex url throws it for example https://pstbn.top/?c865fde3461094d1#2hAyyUKtXm72BHLzyzhq7UBug9YMP1FCgnJccA8YyQ2n

christian-draeger commented 2 years ago

Ok I see. Thx for finding. I will check if we have everything that @rbri suggests, if not I will add it and make new release

rbri commented 2 years ago

Will have a look at https://pstbn.top/?c865fde3461094d1#2hAyyUKtXm72BHLzyzhq7UBug9YMP1FCgnJccA8YyQ2n.

rbri commented 2 years ago

I got here

ScriptException: missing ; before statement (https://pstbn.top/js/zlib-1.2.11.js#6)

Do you see the same?

kazemcodes commented 2 years ago

I got here

ScriptException: missing ; before statement (https://pstbn.top/js/zlib-1.2.11.js#6)

Do you see the same?

I only got this exception java.lang.NoClassDefFoundError: Failed resolution of: Ljava/awt/datatransfer/ClipboardOwner

I have a question regarding htmlunit, is there any way to make headless browser suspend it request until a certail html tag or some criterial fullfit before fetching the htmls something like this func

kazemcodes commented 2 years ago

I got here

ScriptException: missing ; before statement (https://pstbn.top/js/zlib-1.2.11.js#6)

Do you see the same?

I only got this exception java.lang.NoClassDefFoundError: Failed resolution of: Ljava/awt/datatransfer/ClipboardOwner

I have a question regarding htmlunit, is there any way to make headless browser suspend it request until a certail html tag or some criterial fullfit before fetching the htmls

actually skrapeit is using the htmlunit 2.59.0 which throws this exception, higher versions actually required higher api which is android O, I havent tested that version

rbri commented 2 years ago

i guess this api requirement has something to do with changes in Rhino

yusufceylan commented 2 years ago

Any news about this issue? I got this error while using BrowserFetcher

java.lang.NoSuchFieldError: No static field INSTANCE of type Lorg/apache/http/conn/ssl/AllowAllHostnameVerifier; in class Lorg/apache/http/conn/ssl/AllowAllHostnameVerifier; or its superclasses (declaration of 'org.apache.http.conn.ssl.AllowAllHostnameVerifier' appears in /system/framework/framework.jar!classes3.dex)

christian-draeger commented 2 years ago

Are you using latest version of skrapeit (1.2.1)?

Since recently fixes the issue for other people, e.g. here https://github.com/skrapeit/skrape.it/issues/185#issuecomment-1145827982

yusufceylan commented 2 years ago

Are you using latest version of skrapeit (1.2.1)?

Since recently fixes the issue for other people, e.g. here #185 (comment)

I just tried with 1.2.1 and got this error

Execution failed for task ':app:mergeDebugJavaResource'.
> A failure occurred while executing com.android.build.gradle.internal.tasks.MergeJavaResWorkAction
   > 2 files found with path 'mozilla/public-suffix-list.txt' from inputs:
      - /Users/yusuf/.gradle/caches/transforms-3/f245c43f9945c78889e5173b03033420/transformed/jetified-htmlunit-android-2.58.0.jar
      - /Users/yusuf/.gradle/caches/transforms-3/96071c01f90e37e991c04a7f8de1ffc4/transformed/jetified-httpclient-4.5.6.jar
     Adding a packagingOptions block may help, please refer to
     https://google.github.io/android-gradle-dsl/current/com.android.build.gradle.internal.dsl.PackagingOptions.html
     for more information

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

Adding packagingOptions and invalid cache - restart did not help

There is also a SO question about this but there is not any answer yet

yusufceylan commented 2 years ago

Are you using latest version of skrapeit (1.2.1)? Since recently fixes the issue for other people, e.g. here #185 (comment)

I just tried with 1.2.1 and got this error

Execution failed for task ':app:mergeDebugJavaResource'.
> A failure occurred while executing com.android.build.gradle.internal.tasks.MergeJavaResWorkAction
   > 2 files found with path 'mozilla/public-suffix-list.txt' from inputs:
      - /Users/yusuf/.gradle/caches/transforms-3/f245c43f9945c78889e5173b03033420/transformed/jetified-htmlunit-android-2.58.0.jar
      - /Users/yusuf/.gradle/caches/transforms-3/96071c01f90e37e991c04a7f8de1ffc4/transformed/jetified-httpclient-4.5.6.jar
     Adding a packagingOptions block may help, please refer to
     https://google.github.io/android-gradle-dsl/current/com.android.build.gradle.internal.dsl.PackagingOptions.html
     for more information

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

Adding packagingOptions and invalid cache - restart did not help

There is also a SO question about this but there is not any answer yet

Putting public-suffix to packaging-options solve the compilation error but this time got the same error with @kazemcodes

E/AndroidRuntime:     at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:570)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:749)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:677)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:664)
        Suppressed: kotlinx.coroutines.DiagnosticCoroutineContextException: [StandaloneCoroutine{Cancelling}@7073866, Dispatchers.Main.immediate]
    Caused by: java.lang.ClassNotFoundException: Didn't find class "java.awt.datatransfer.ClipboardOwner" on path: DexPathList[[zip file "/data/app/com.project.skrapeplayground-1r7tHui0F1lYklkZgTKUlQ==/base.apk"],nativeLibraryDirectories=[/data/app/com.project.skrapeplayground-1r7tHui0F1lYklkZgTKUlQ==/lib/arm64, /system/lib64, /hw_product/lib64, /system/product/lib64]]
        at dalvik.system.BaseDexClassLoader.findClass(BaseDexClassLoader.java:209)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:379)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:312)
            ... 49 more
kazemcodes commented 2 years ago

Recently, htmlunit-android release a new snapshot , that fixed this problem, but right now it requires at least android O as minimum api requirement

kazemcodes commented 2 years ago

please update the html unit to latest snapshot, this problem is fixed in last snapshot

net.sourceforge.htmlunit:htmlunit-android:2.63.0-SNAPSHOT
rbri commented 1 year ago

htmlunit-android:2.63.0 was released some days ago (https://twitter.com/htmlunit)

christian-draeger commented 1 year ago

Big Thx for the great work @rbri. I will bump the version in browserfetcher and release ne version of skrape it

rbri commented 1 year ago

it's a pleasure

christian-draeger commented 1 year ago

skrapeit patch version 1.2.2 has just been published to maven central