gildas-lormeau / single-file-core

GNU Affero General Public License v3.0
25 stars 9 forks source link

Too many requests, Inconsistent Html generated #2

Closed anilabhadatta closed 4 months ago

anilabhadatta commented 7 months ago

@gildas-lormeau

Describe the bug When saving a page using extension / via console, too many requests are being sent, and HTML creation is inconsistent.

To Reproduce Steps to reproduce the behavior: I have created a scraper that scrapes educative.io, I have integrated your single-file library, reference: https://github.com/gildas-lormeau/SingleFile/wiki/How-to-integrate-SingleFile-library-code-in-%22custom%22-environments%3F Some pages have a huge amount of images for eg: Downloads.zip I have attached a zip containing the HTML scraped using the extension and via console.

I have noticed that while scraping, too many requests are made by single-file even though the page is loaded properly. This causes captcha issues and causes the scraper to fail and also saves the inconsistent HTML, the image is showing properly in the browser but not in the saved HTML. I have used the DEFAULT_CONFIG and the config attached in your wiki, both generates the same problem.

Screenshots

image

Environment

Please check the following, if possible kindly save the HTML from the browser without making any further requests, and the image should also be saved properly.

gildas-lormeau commented 7 months ago

I guess the server is not happy with all the simultaneous requests. You could circumvent this issue by providing your own fetch implementation.

For this, you could include for example the code below just before calling singlefile.getPageData().

function getSequentialFetch(delay = 0) {
    let pendingRequests = [];
    return async (...args) => {
        pendingRequests.push((async () => {
            if (pendingRequests.length) {
                await pendingRequests[pendingRequests.length - 1];
            }
            await new Promise(resolve => setTimeout(resolve, delay));
            return fetch(...args);
        })());
        return pendingRequests[pendingRequests.length - 1];
    };
}

Then you need to pass the custom fetch implementation to singlefile.getPageData() by replacing this code with the code below.

const { content, title, filename } = await singlefile.getPageData({
    removeImports: true,
    removeScripts: true,
    removeAudioSrc: true,
    removeVideoSrc: true,
    removeHiddenElements: true,
    removeUnusedStyles: true,
    removeUnusedFonts: true,
    compressHTML: true,
    blockVideos: true,
    blockScripts: true,
    networkTimeout: 60000
}, { fetch: getSequentialFetch(1000) });
anilabhadatta commented 7 months ago

@gildas-lormeau I will test this, currently implementing the ucdriver(almost done).

anilabhadatta commented 7 months ago

@gildas-lormeau tested your solution didn't work. The process I followed in dev tools:

  1. Loaded the URL in browser
  2. Opened up all the slides using this js code
            svgs.forEach(svg => {
                var button = svg.parentNode;
                if(button.disabled === false) {
                  button.click();
                  button.disabled = true;
            }});
  3. Injected the single-file.js and other files as stated in your wiki, ref: https://github.com/anilabhadatta/educative.io_scraper/blob/e0e9d988d1a768b862e1c187837b0628d1d1a51b/src/ScraperType/CourseTopicScraper/ScraperModules/SingleFileUtility.py#L54
  4. Executed the getSequentialFetch() block
  5. Initiated the getPageData execution. This process allowed me to stop the requests that were happening previously the extra requests didn't happen. Also the generated HTML was same for both cases. The HTML is showing in browser properly but singlefile was not able to capture the images correctly.
anilabhadatta commented 7 months ago

@gildas-lormeau Tested again, the requests are happening but they are taking 1 seconds at a time. However this causes an issue, singlefile already generated the content before all the requests are complete. leading to again inconsistent HTML page. video ref: edit(I changed the delay to 100ms) DevTools - www.educative.io_courses_grokking-dynamic-programming-a-deep-dive-using-python_longest-common-subsequence_showContent=true 2024-02-25 16-42-04.mp4.zip

My question is, if the page is already loaded properly in browser, why does singleFile need to make subsequent requests again. Is it not possible to get the image data from the browser cache?

gildas-lormeau commented 7 months ago

SingleFile needs to do requests because there's no alternative as far as I know. Note that if the website and the browser is properly configured then these external resources should come from the cache.

anilabhadatta commented 7 months ago

@gildas-lormeau I kind of did a retest on my scraper using default settings and an older commit of July 23, The functionality is same but I made some changes to the scraper it self. I am changing the height of the browser to the max page length and scrolling the page 2-3 times with a 0.5 seconds timer This made the scraper slow but to some extent the scraping is working correctly, I faced a issue once but reliability did increase and the image errors didn't happen yet (although I need to check all the files but I think the pictures were loaded properly in some HTML pages) Also I have implemented ucdriver which does some cloudflare bypass to some extent. This however doesn't change the fact that single file stopped making requests. It is happening but since there is a huge delay so the requests are partially acceptable by servers.

anilabhadatta commented 7 months ago

@gildas-lormeau I think there is however a workaround only for images.

// Assuming you have an element with the id "yourImageId" var imgElement = document.getElementById('yourImageId');

// Create a canvas element var canvas = document.createElement('canvas'); var context = canvas.getContext('2d');

// Set the canvas size to match the image size canvas.width = imgElement.width; canvas.height = imgElement.height;

// Draw the image onto the canvas context.drawImage(imgElement, 0, 0, imgElement.width, imgElement.height);

// Get the base64-encoded data from the canvas var imageData = canvas.toDataURL('image/png');

// Log or use the base64 data as needed console.log(imageData);

Using this the base64 data can be generated for various types of img tags like img, image etc. This might increase some processing time but better to remove continuous requests. Also I think even if the browser is loading cached elements, single file will make a HTTP req, probably to convert the file into base64, so the image data creation method defined above might help a bit. Can you give me confirmation if my intuition is correct? and maybe so if this solution will generate accurate result? If so then, is it possible to implement a secondary method of saving images?

anilabhadatta commented 5 months ago

@gildas-lormeau Hi I have found an issue although I am not sure if I am doing any mistake As explained here I inject the js into respective location and try to get the HTML. But for some reason the data inside iframes are not showing in the generated HTML, It used to generate previously but now it is not showing up but using extension it is saving correctly. https://github.com/gildas-lormeau/SingleFile/wiki/How-to-integrate-SingleFile-library-code-in-%22custom%22-environments%3F

https://github.com/anilabhadatta/educative.io_scraper/blob/v3-master/src/ScraperType/CourseTopicScraper/ScraperModules/SingleFileUtility.py line 54 and 141

Educative_ Interactive Courses for Software Developers (4_11_2024 12_53_14 AM).zip HTML saved using extension

file (1).zip Html saved using js command

gildas-lormeau commented 5 months ago

@anilabhadatta I'm sorry I didn't get back to you, but I didn't mean to. I think the problem is that iframe contents are blocked via the Same Origin Policy. To get around this, the most reliable solution is to use the Chrome DevTools Protocol API, see https://www.selenium.dev/documentation/webdriver/bidirectional/chrome_devtools/cdp_api/ and https://chromedevtools.github.io/devtools-protocol/. For the record, you're already using it here: https://github.com/anilabhadatta/educative.io_scraper/blob/9e246b342cf3460cdbbc4ec8520c48228ba55eef/src/Utility/BrowserUtility.py#L130-L136 in your project (but directly with the websocket API instead of using Selenium). You'd need to call Page.addScriptToEvaluateOnNewDocument to inject SingleFile scripts into all frames and the main window. Then use Runtime.evaluate to evaluate the script retrieving the result. This is what is done here in the current version of single-file-cli: https://github.com/gildas-lormeau/single-file-cli/blob/bd7c228373660394c99a24da26bb10489229a6b3/lib/cdp-client.js#L96-L113.

anilabhadatta commented 5 months ago

@gildas-lormeau I have to go through it once. Will get back to you tomorrow. Although I am not exactly sure if I am getting blocked due to CORS because when I inject, I don't get any errors and the script tags are created correctly inside iframes. And while loading the browser, I have disabled the security policies as well. Also I noticed that here there is iframe within an iframe, although I injected the files in all of the iframes but still didn't get any result.

I will check your solution and see if that can be fixed using CDP

gildas-lormeau commented 5 months ago

Indeed, if you see the scripts then it means the SOP is not blocking you. Maybe calling injectScriptToHTML(scriptTag, location) recursively on the document of iframes (by calling it after line 64 in https://github.com/anilabhadatta/educative.io_scraper/blob/v3-master/src/ScraperType/CourseTopicScraper/ScraperModules/SingleFileUtility.py) could suffice.

anilabhadatta commented 5 months ago

@gildas-lormeau https://github.com/anilabhadatta/educative.io_scraper/blob/9e246b342cf3460cdbbc4ec8520c48228ba55eef/src/ScraperType/CourseTopicScraper/ScraperModules/ScreenshotUtility.py

I have used cdp to get screenshot. I will check th CDP solution once, but what should be the source URL there?

anilabhadatta commented 5 months ago

Indeed, if you see the scripts then it means the SOP is not blocking you. Maybe calling injectScriptToHTML(scriptTag, location) recursively (by calling it after line 64 on the document of iframes in https://github.com/anilabhadatta/educative.io_scraper/blob/v3-master/src/ScraperType/CourseTopicScraper/ScraperModules/SingleFileUtility.py) could suffice.

I kind of did that and now it is working😂

image image

so basically educative changed the iframe logic which cause me those failures😢 I would like to know if I implement the cdp solution will it recursively if there are any sub iframes and add the scripts automatically ?

gildas-lormeau commented 5 months ago

This is great news! I can confirm that by using the CDP APIs I mentioned, this will inject the script in all the frames. The advantage is that you don't have to code anything for this.

anilabhadatta commented 5 months ago

@gildas-lormeau Hi i was trying to implemenet the CDP solution but not been able to, can you please tell me if I am missing something


            self.browser.get("https://www.educative.io/module/page/MjprXLCkmQNnQGAvK/10370001/5890664872148992/5175126441197568")
            time.sleep(10)
            print("loaded page")

            self.seleniumBasicUtils = SeleniumBasicUtility(configJson)
            self.seleniumBasicUtils.browser = self.browser
            with open(constants.singleFileBundlePath, "r") as file:
                jsCode = file.read()
            self.browser.execute_script(jsCode)
            script = self.browser.execute_script(f"{jsCode} return script;")
            hookScript = self.browser.execute_script(f"{jsCode} return hookScript;")
            print(script == hookScript, len(script), len(hookScript))
            self.seleniumBasicUtils.sendCommand('Page.enable', {})
            params1 = {"enabled": True}
            params2 = {"ignore": True}
            params3 = {
                "source": hookScript,
                "runImmediately": True,
            }
            params4 = {
                "source": script,
                "runImmediately": True,
                "worldName": "SINGLE_FILE_WORLD_NAME"
            }
            params5 = {
                "expression": """singlefile.getPageData({
                                removeImports: true,
                                removeScripts: true,
                                removeAudioSrc: true,
                                removeVideoSrc: true,
                                removeHiddenElements: true,
                                removeUnusedStyles: true,
                                removeUnusedFonts: true,
                                compressHTML: true,
                                blockVideos: true,
                                blockScripts: true,
                                networkTimeout: 60000
                            })""",
                "awaitPromise": True,
                "returnByValue": True
            }

            res1 = self.seleniumBasicUtils.sendCommand("Page.setBypassCSP", params1)
            res2 = self.seleniumBasicUtils.sendCommand("Security.setIgnoreCertificateErrors", params2)
            res3 = self.seleniumBasicUtils.sendCommand("Page.addScriptToEvaluateOnNewDocument", params3)
            res4 = self.seleniumBasicUtils.sendCommand("Page.addScriptToEvaluateOnNewDocument", params4)
            res5 = self.seleniumBasicUtils.sendCommand("Runtime.evaluate", params5)
            print("completed")
            print(res1, res2, res3, res4, res5)```

output--------

loaded page
False 854257 9459
compelted
{}
{}
{'identifier': '2'} 
{'identifier': '3'}
 {'exceptionDetails': {'columnNumber': 0, 'exception': {'className': 'ReferenceError', 'description': 'ReferenceError: singlefile is not defined\n    at <an
onymous>:1:1', 'objectId': '8274231661729392625.1.46', 'subtype': 'error', 'type': 'object'}, 'exceptionId': 42, 'lineNumber': 0, 'scriptId': '230', 'stackTrace': {'callFrames': [{'columnNumber': 0, 'f
unctionName': '', 'lineNumber': 0, 'scriptId': '230', 'url': ''}]}, 'text': 'Uncaught'}, 'result': {'className': 'ReferenceError', 'description': 'ReferenceError: singlefile is not defined\n    at <anonymous>:1:1', 'objectId': '8274231661729392625.1.45', 'subtype': 'error', 'type': 'object'}}
anilabhadatta commented 5 months ago
import time

from src.Logging.Logger import Logger
from src.ScraperType.CourseTopicScraper.ScraperModules.SeleniumBasicUtility import SeleniumBasicUtility
from src.Utility.BrowserUtility import BrowserUtility
from src.Common.Constants import constants

class LoginAccount:
    def __init__(self, configJson=None):
        self.browserUtil = None
        self.logger = None
        if configJson:
            self.logger = Logger(configJson, "LoginAccount").logger
        self.configJson = configJson
        self.browser = None

    def start(self, configJson):
        self.configJson = configJson
        self.configJson['headless'] = False
        self.browserUtil = BrowserUtility(self.configJson)
        self.logger = Logger(self.configJson, "LoginAccount").logger
        self.logger.info("""LoginAccount initiated...
                            Login your account in the browser...
                            To Terminate, Click on Logout Button
                         """)
        try:
            self.browser = self.browserUtil.loadBrowser()
            self.browser.get("about:blank")
            time.sleep(2)

            self.seleniumBasicUtils = SeleniumBasicUtility(configJson)
            self.seleniumBasicUtils.browser = self.browser
            with open(constants.singleFileBundlePath, "r") as file:
                jsCode = file.read()
            self.browser.execute_script(jsCode)
            script = self.browser.execute_script(f"{jsCode} return script;")
            hookScript = self.browser.execute_script(f"{jsCode} return hookScript;")
            print(script == hookScript, len(script), len(hookScript))
            self.seleniumBasicUtils.sendCommand('Page.enable', {})
            params1 = {"enabled": True}
            params2 = {"ignore": True}
            params3 = {
                "source": hookScript,
                "runImmediately": True,
            }
            params4 = {
                "source": script,
                "runImmediately": True,
                "worldName": "SINGLE_FILE_WORLD_NAME"
            }
            params5 = {
                "expression": """singlefile.getPageData({
                                removeImports: true,
                                removeScripts: true,
                                removeAudioSrc: true,
                                removeVideoSrc: true,
                                removeHiddenElements: true,
                                removeUnusedStyles: true,
                                removeUnusedFonts: true,
                                compressHTML: true,
                                blockVideos: true,
                                blockScripts: true,
                                networkTimeout: 60000
                            })""",
                "awaitPromise": True,
                "returnByValue": True
            }

            injectImportantScriptsJsScript = """
                        function injectScriptToHTML(scriptTag, doc = document) {
                            var targetElement = doc.body || doc.documentElement;
                            targetElement.appendChild(scriptTag.cloneNode(true));
                            var frames = doc.querySelectorAll("frame, iframe");
                            frames.forEach(frame => {
                                var frameDocument = frame.contentDocument || frame.contentWindow.document;
                                if (frameDocument) {
                                    injectScriptToHTML(scriptTag, frameDocument);
                                }
                            });
                        }

                        function createScriptTagFromURL(url) {
                            return fetch(url)
                                .then(response => response.text())
                                .then(data => {
                                    var scriptElement = document.createElement('script');
                                    scriptElement.type = 'text/javascript';
                                    scriptElement.textContent = data;
                                    return scriptElement;
                                })
                                .catch(error => {
                                    console.error('Error loading script:', error);
                                    return null;
                                });
                        }
                        window.__define = window.define;
                        window.__require = window.require;
                        window.define = undefined;
                        window.require = undefined;
                        var baseurl = 'https://anilabhadatta.github.io/SingleFile/';
                        var urls = [
                        'lib/single-file-bootstrap.js',
                        'lib/single-file-hooks-frames.js',
                        'lib/single-file-frames.js',
                        'lib/single-file.js'
                        ];
                        var fullUrls = urls.map(url => baseurl + url);

                        for(let i=0; i< fullUrls.length; i++){
                            createScriptTagFromURL(fullUrls[i])
                                .then(scriptTag => {
                                    if (scriptTag) {
                                        injectScriptToHTML(scriptTag);
                                    }
                                });
                        }
                        """

            paramstest = {
                "source": injectImportantScriptsJsScript,
                "runImmediately": True
            }

            res1 = self.seleniumBasicUtils.sendCommand("Page.setBypassCSP", params1)
            res2 = self.seleniumBasicUtils.sendCommand("Security.setIgnoreCertificateErrors", params2)
            res3 = self.seleniumBasicUtils.sendCommand("Page.addScriptToEvaluateOnNewDocument", params3)
            res4 = self.seleniumBasicUtils.sendCommand("Page.addScriptToEvaluateOnNewDocument", params4)
            self.seleniumBasicUtils.sendCommand("Page.addScriptToEvaluateOnNewDocument", paramstest)
            self.browser.get(
                "https://www.educative.io/module/page/MjprXLCkmQNnQGAvK/10370001/5890664872148992/5175126441197568")
            time.sleep(100)
            print("loaded page")
            res5 = self.seleniumBasicUtils.sendCommand("Runtime.evaluate", params5)
            print("completed")
            print(res1, res2, res3, res4, res5)
            time.sleep(100)
            # self.browser.get(
            #     "https://www.educative.io/module/page/MjprXLCkmQNnQGAvK/10370001/5890664872148992/5175126441197568")
            # self.seleniumBasicUtils.sendCommand("Page.setBypassCSP", params1)
            # self.seleniumBasicUtils.sendCommand("Security.setIgnoreCertificateErrors", params2)
            # self.seleniumBasicUtils.sendCommand("Page.addScriptToEvaluateOnNewDocument", params4)
            # self.seleniumBasicUtils.sendCommand("Page.addScriptToEvaluateOnNewDocument", params3)
            print("done")
            while True:
                pass
        except KeyboardInterrupt:
            self.logger.error("Keyboard Interrupt")
        except Exception as e:
            lineNumber = e.__traceback__.tb_lineno
            self.logger.error(f"start: {lineNumber}: {e}")
        finally:
            self.logger.debug("Exiting...")
            asyncio.get_event_loop().run_until_complete(self.browserUtil.shutdownChromeViaWebsocket())

    def checkIfLoggedIn(self):
        self.logger = Logger(self.configJson, "LoginAccount").logger
        self.logger.info("Checking if logged in...")
        isLoggedIn = bool(self.browser.execute_script(
            '''return document.cookie.includes('logged_in')'''))
        if not isLoggedIn:
            raise Exception("Login to your account in the browser...")

Kind of mixed up the code. seems to be working. but need to refactor it. What I understood is that before requesting a page, I need to setup the CDP so that the singleFile scripts will be injected automatically,

anilabhadatta commented 5 months ago

@gildas-lormeau Could not implement the CDP injection correctly using the above code. There are some issues which I can't understand. Probably the iframes are getting loaded few seconds after so, using the above solution to inject in iframes is not working correctly if the page is requested multiple times.

I tried to understand the code in single-file-cli project. Tried to implement it as well but I am not sure if I am doing it correctly.

            script = self.browser.execute_script(f"{singleFileJs} return script;")
            hookScript = self.browser.execute_script(f"{singleFileJs} return hookScript;")
            params1 = {"enabled": True}
            params2 = {"ignore": True}
            params3 = {
                "source": hookScript,
                "runImmediately": True
            }
            script += """(function initSingleFile() { singlefile.init({ fetch: (url, options) => { return new Promise(function (resolve, reject) { const xhrRequest = new XMLHttpRequest(); xhrRequest.withCredentials = true; xhrRequest.responseType = "arraybuffer"; xhrRequest.onerror = event => reject(new Error(event.detail)); xhrRequest.onabort = () => reject(new Error("aborted")); xhrRequest.onreadystatechange = () => { if (xhrRequest.readyState == XMLHttpRequest.DONE) { resolve({ arrayBuffer: async () => xhrRequest.response || new ArrayBuffer(), headers: { get: headerName => xhrRequest.getResponseHeader(headerName) }, status: xhrRequest.status }); } }; xhrRequest.open("GET", url, true); if (options.headers) { for (const entry of Object.entries(options.headers)) { xhrRequest.setRequestHeader(entry[0], entry[1]); } } xhrRequest.send(); }); } }); })();"""
            params4 = {
                "source": script,
                "runImmediately": True,
                "worldName": "SINGLE_FILE_WORLD_NAME"
            }
            self.seleniumBasicUtils.sendCommand('Page.enable', {})
            self.seleniumBasicUtils.sendCommand("Page.setBypassCSP", params1)
            self.seleniumBasicUtils.sendCommand("Security.setIgnoreCertificateErrors", params2)
            self.seleniumBasicUtils.sendCommand("Page.addScriptToEvaluateOnNewDocument", params3)
            self.seleniumBasicUtils.sendCommand("Page.addScriptToEvaluateOnNewDocument", params4)

When I try to get the pageData, it fails. singleFile is not defined.

gildas-lormeau commented 5 months ago

It's probably due the fact that you try to run the script in an isolated world. Removing "worldName": "SINGLE_FILE_WORLD_NAME" might fix the issue.

anilabhadatta commented 5 months ago

@gildas-lormeau what i saw is that the selenium browser is able to load the iframe when first load in a tab but after requesting the same URL again, the iframes do not get loaded properly which causes issue. So the alternative was to create new tab and request the URL which removes the iframe issue.

anilabhadatta commented 5 months ago

@gildas-lormeau Hi Can you please explain what does this do? Is it important to implement the solution in my scraper as well.

source += await readScriptFiles(options && options.browserScripts ? options.browserScripts : []);
    if (options.browserStylesheets && options.browserStylesheets.length) {
        source += "addEventListener(\"load\",()=>{const styleElement=document.createElement(\"style\");styleElement.textContent=" + JSON.stringify(await readScriptFiles(options.browserStylesheets)) + ";document.body.appendChild(styleElement);});";
    }

3.7.4: https://github.com/anilabhadatta/educative.io_scraper/commit/c77089b0bbf4a340a084c6ea44ac8448398196af 3.7.3: https://github.com/anilabhadatta/educative.io_scraper/commit/c0a102dfb2b5f154799b9e66043473d3a122b42a

I have recently committed a code which basically uses the previous comment code logic. It generates the single file data correctly. Only issue I was facing that is now resolved is by creating new tabs before injecting the scripts through cdp. But I want to know if the above code is also required and what should I do for the approach?

Thanks in advance

gildas-lormeau commented 4 months ago

Sorry for the late reply, the code you cited allows the user to inject custom scripts and/or stylesheets. This code is not required.

anilabhadatta commented 4 months ago

@gildas-lormeau ohh okay, thanks for your support