Closed anilabhadatta closed 4 months ago
I guess the server is not happy with all the simultaneous requests. You could circumvent this issue by providing your own fetch
implementation.
For this, you could include for example the code below just before calling singlefile.getPageData()
.
function getSequentialFetch(delay = 0) {
let pendingRequests = [];
return async (...args) => {
pendingRequests.push((async () => {
if (pendingRequests.length) {
await pendingRequests[pendingRequests.length - 1];
}
await new Promise(resolve => setTimeout(resolve, delay));
return fetch(...args);
})());
return pendingRequests[pendingRequests.length - 1];
};
}
Then you need to pass the custom fetch implementation to singlefile.getPageData()
by replacing this code with the code below.
const { content, title, filename } = await singlefile.getPageData({
removeImports: true,
removeScripts: true,
removeAudioSrc: true,
removeVideoSrc: true,
removeHiddenElements: true,
removeUnusedStyles: true,
removeUnusedFonts: true,
compressHTML: true,
blockVideos: true,
blockScripts: true,
networkTimeout: 60000
}, { fetch: getSequentialFetch(1000) });
@gildas-lormeau I will test this, currently implementing the ucdriver(almost done).
@gildas-lormeau tested your solution didn't work. The process I followed in dev tools:
svgs.forEach(svg => {
var button = svg.parentNode;
if(button.disabled === false) {
button.click();
button.disabled = true;
}});
@gildas-lormeau Tested again, the requests are happening but they are taking 1 seconds at a time. However this causes an issue, singlefile already generated the content before all the requests are complete. leading to again inconsistent HTML page. video ref: edit(I changed the delay to 100ms) DevTools - www.educative.io_courses_grokking-dynamic-programming-a-deep-dive-using-python_longest-common-subsequence_showContent=true 2024-02-25 16-42-04.mp4.zip
My question is, if the page is already loaded properly in browser, why does singleFile need to make subsequent requests again. Is it not possible to get the image data from the browser cache?
SingleFile needs to do requests because there's no alternative as far as I know. Note that if the website and the browser is properly configured then these external resources should come from the cache.
@gildas-lormeau I kind of did a retest on my scraper using default settings and an older commit of July 23, The functionality is same but I made some changes to the scraper it self. I am changing the height of the browser to the max page length and scrolling the page 2-3 times with a 0.5 seconds timer This made the scraper slow but to some extent the scraping is working correctly, I faced a issue once but reliability did increase and the image errors didn't happen yet (although I need to check all the files but I think the pictures were loaded properly in some HTML pages) Also I have implemented ucdriver which does some cloudflare bypass to some extent. This however doesn't change the fact that single file stopped making requests. It is happening but since there is a huge delay so the requests are partially acceptable by servers.
@gildas-lormeau I think there is however a workaround only for images.
// Assuming you have an element with the id "yourImageId" var imgElement = document.getElementById('yourImageId');
// Create a canvas element var canvas = document.createElement('canvas'); var context = canvas.getContext('2d');
// Set the canvas size to match the image size canvas.width = imgElement.width; canvas.height = imgElement.height;
// Draw the image onto the canvas context.drawImage(imgElement, 0, 0, imgElement.width, imgElement.height);
// Get the base64-encoded data from the canvas var imageData = canvas.toDataURL('image/png');
// Log or use the base64 data as needed console.log(imageData);
Using this the base64 data can be generated for various types of img tags like img, image etc. This might increase some processing time but better to remove continuous requests. Also I think even if the browser is loading cached elements, single file will make a HTTP req, probably to convert the file into base64, so the image data creation method defined above might help a bit. Can you give me confirmation if my intuition is correct? and maybe so if this solution will generate accurate result? If so then, is it possible to implement a secondary method of saving images?
@gildas-lormeau Hi I have found an issue although I am not sure if I am doing any mistake As explained here I inject the js into respective location and try to get the HTML. But for some reason the data inside iframes are not showing in the generated HTML, It used to generate previously but now it is not showing up but using extension it is saving correctly. https://github.com/gildas-lormeau/SingleFile/wiki/How-to-integrate-SingleFile-library-code-in-%22custom%22-environments%3F
https://github.com/anilabhadatta/educative.io_scraper/blob/v3-master/src/ScraperType/CourseTopicScraper/ScraperModules/SingleFileUtility.py line 54 and 141
Educative_ Interactive Courses for Software Developers (4_11_2024 12_53_14 AM).zip HTML saved using extension
file (1).zip Html saved using js command
@anilabhadatta I'm sorry I didn't get back to you, but I didn't mean to.
I think the problem is that iframe contents are blocked via the Same Origin Policy. To get around this, the most reliable solution is to use the Chrome DevTools Protocol API, see https://www.selenium.dev/documentation/webdriver/bidirectional/chrome_devtools/cdp_api/ and https://chromedevtools.github.io/devtools-protocol/. For the record, you're already using it here: https://github.com/anilabhadatta/educative.io_scraper/blob/9e246b342cf3460cdbbc4ec8520c48228ba55eef/src/Utility/BrowserUtility.py#L130-L136 in your project (but directly with the websocket API instead of using Selenium).
You'd need to call Page.addScriptToEvaluateOnNewDocument
to inject SingleFile scripts into all frames and the main window. Then use Runtime.evaluate
to evaluate the script retrieving the result. This is what is done here in the current version of single-file-cli
: https://github.com/gildas-lormeau/single-file-cli/blob/bd7c228373660394c99a24da26bb10489229a6b3/lib/cdp-client.js#L96-L113.
@gildas-lormeau I have to go through it once. Will get back to you tomorrow. Although I am not exactly sure if I am getting blocked due to CORS because when I inject, I don't get any errors and the script tags are created correctly inside iframes. And while loading the browser, I have disabled the security policies as well. Also I noticed that here there is iframe within an iframe, although I injected the files in all of the iframes but still didn't get any result.
I will check your solution and see if that can be fixed using CDP
Indeed, if you see the scripts then it means the SOP is not blocking you. Maybe calling injectScriptToHTML(scriptTag, location)
recursively on the document of iframes (by calling it after line 64 in https://github.com/anilabhadatta/educative.io_scraper/blob/v3-master/src/ScraperType/CourseTopicScraper/ScraperModules/SingleFileUtility.py) could suffice.
I have used cdp to get screenshot. I will check th CDP solution once, but what should be the source URL there?
Indeed, if you see the scripts then it means the SOP is not blocking you. Maybe calling
injectScriptToHTML(scriptTag, location)
recursively (by calling it after line 64 on the document of iframes in https://github.com/anilabhadatta/educative.io_scraper/blob/v3-master/src/ScraperType/CourseTopicScraper/ScraperModules/SingleFileUtility.py) could suffice.
I kind of did that and now it is working😂
so basically educative changed the iframe logic which cause me those failures😢 I would like to know if I implement the cdp solution will it recursively if there are any sub iframes and add the scripts automatically ?
This is great news! I can confirm that by using the CDP APIs I mentioned, this will inject the script in all the frames. The advantage is that you don't have to code anything for this.
@gildas-lormeau Hi i was trying to implemenet the CDP solution but not been able to, can you please tell me if I am missing something
self.browser.get("https://www.educative.io/module/page/MjprXLCkmQNnQGAvK/10370001/5890664872148992/5175126441197568")
time.sleep(10)
print("loaded page")
self.seleniumBasicUtils = SeleniumBasicUtility(configJson)
self.seleniumBasicUtils.browser = self.browser
with open(constants.singleFileBundlePath, "r") as file:
jsCode = file.read()
self.browser.execute_script(jsCode)
script = self.browser.execute_script(f"{jsCode} return script;")
hookScript = self.browser.execute_script(f"{jsCode} return hookScript;")
print(script == hookScript, len(script), len(hookScript))
self.seleniumBasicUtils.sendCommand('Page.enable', {})
params1 = {"enabled": True}
params2 = {"ignore": True}
params3 = {
"source": hookScript,
"runImmediately": True,
}
params4 = {
"source": script,
"runImmediately": True,
"worldName": "SINGLE_FILE_WORLD_NAME"
}
params5 = {
"expression": """singlefile.getPageData({
removeImports: true,
removeScripts: true,
removeAudioSrc: true,
removeVideoSrc: true,
removeHiddenElements: true,
removeUnusedStyles: true,
removeUnusedFonts: true,
compressHTML: true,
blockVideos: true,
blockScripts: true,
networkTimeout: 60000
})""",
"awaitPromise": True,
"returnByValue": True
}
res1 = self.seleniumBasicUtils.sendCommand("Page.setBypassCSP", params1)
res2 = self.seleniumBasicUtils.sendCommand("Security.setIgnoreCertificateErrors", params2)
res3 = self.seleniumBasicUtils.sendCommand("Page.addScriptToEvaluateOnNewDocument", params3)
res4 = self.seleniumBasicUtils.sendCommand("Page.addScriptToEvaluateOnNewDocument", params4)
res5 = self.seleniumBasicUtils.sendCommand("Runtime.evaluate", params5)
print("completed")
print(res1, res2, res3, res4, res5)```
output--------
loaded page
False 854257 9459
compelted
{}
{}
{'identifier': '2'}
{'identifier': '3'}
{'exceptionDetails': {'columnNumber': 0, 'exception': {'className': 'ReferenceError', 'description': 'ReferenceError: singlefile is not defined\n at <an
onymous>:1:1', 'objectId': '8274231661729392625.1.46', 'subtype': 'error', 'type': 'object'}, 'exceptionId': 42, 'lineNumber': 0, 'scriptId': '230', 'stackTrace': {'callFrames': [{'columnNumber': 0, 'f
unctionName': '', 'lineNumber': 0, 'scriptId': '230', 'url': ''}]}, 'text': 'Uncaught'}, 'result': {'className': 'ReferenceError', 'description': 'ReferenceError: singlefile is not defined\n at <anonymous>:1:1', 'objectId': '8274231661729392625.1.45', 'subtype': 'error', 'type': 'object'}}
import time
from src.Logging.Logger import Logger
from src.ScraperType.CourseTopicScraper.ScraperModules.SeleniumBasicUtility import SeleniumBasicUtility
from src.Utility.BrowserUtility import BrowserUtility
from src.Common.Constants import constants
class LoginAccount:
def __init__(self, configJson=None):
self.browserUtil = None
self.logger = None
if configJson:
self.logger = Logger(configJson, "LoginAccount").logger
self.configJson = configJson
self.browser = None
def start(self, configJson):
self.configJson = configJson
self.configJson['headless'] = False
self.browserUtil = BrowserUtility(self.configJson)
self.logger = Logger(self.configJson, "LoginAccount").logger
self.logger.info("""LoginAccount initiated...
Login your account in the browser...
To Terminate, Click on Logout Button
""")
try:
self.browser = self.browserUtil.loadBrowser()
self.browser.get("about:blank")
time.sleep(2)
self.seleniumBasicUtils = SeleniumBasicUtility(configJson)
self.seleniumBasicUtils.browser = self.browser
with open(constants.singleFileBundlePath, "r") as file:
jsCode = file.read()
self.browser.execute_script(jsCode)
script = self.browser.execute_script(f"{jsCode} return script;")
hookScript = self.browser.execute_script(f"{jsCode} return hookScript;")
print(script == hookScript, len(script), len(hookScript))
self.seleniumBasicUtils.sendCommand('Page.enable', {})
params1 = {"enabled": True}
params2 = {"ignore": True}
params3 = {
"source": hookScript,
"runImmediately": True,
}
params4 = {
"source": script,
"runImmediately": True,
"worldName": "SINGLE_FILE_WORLD_NAME"
}
params5 = {
"expression": """singlefile.getPageData({
removeImports: true,
removeScripts: true,
removeAudioSrc: true,
removeVideoSrc: true,
removeHiddenElements: true,
removeUnusedStyles: true,
removeUnusedFonts: true,
compressHTML: true,
blockVideos: true,
blockScripts: true,
networkTimeout: 60000
})""",
"awaitPromise": True,
"returnByValue": True
}
injectImportantScriptsJsScript = """
function injectScriptToHTML(scriptTag, doc = document) {
var targetElement = doc.body || doc.documentElement;
targetElement.appendChild(scriptTag.cloneNode(true));
var frames = doc.querySelectorAll("frame, iframe");
frames.forEach(frame => {
var frameDocument = frame.contentDocument || frame.contentWindow.document;
if (frameDocument) {
injectScriptToHTML(scriptTag, frameDocument);
}
});
}
function createScriptTagFromURL(url) {
return fetch(url)
.then(response => response.text())
.then(data => {
var scriptElement = document.createElement('script');
scriptElement.type = 'text/javascript';
scriptElement.textContent = data;
return scriptElement;
})
.catch(error => {
console.error('Error loading script:', error);
return null;
});
}
window.__define = window.define;
window.__require = window.require;
window.define = undefined;
window.require = undefined;
var baseurl = 'https://anilabhadatta.github.io/SingleFile/';
var urls = [
'lib/single-file-bootstrap.js',
'lib/single-file-hooks-frames.js',
'lib/single-file-frames.js',
'lib/single-file.js'
];
var fullUrls = urls.map(url => baseurl + url);
for(let i=0; i< fullUrls.length; i++){
createScriptTagFromURL(fullUrls[i])
.then(scriptTag => {
if (scriptTag) {
injectScriptToHTML(scriptTag);
}
});
}
"""
paramstest = {
"source": injectImportantScriptsJsScript,
"runImmediately": True
}
res1 = self.seleniumBasicUtils.sendCommand("Page.setBypassCSP", params1)
res2 = self.seleniumBasicUtils.sendCommand("Security.setIgnoreCertificateErrors", params2)
res3 = self.seleniumBasicUtils.sendCommand("Page.addScriptToEvaluateOnNewDocument", params3)
res4 = self.seleniumBasicUtils.sendCommand("Page.addScriptToEvaluateOnNewDocument", params4)
self.seleniumBasicUtils.sendCommand("Page.addScriptToEvaluateOnNewDocument", paramstest)
self.browser.get(
"https://www.educative.io/module/page/MjprXLCkmQNnQGAvK/10370001/5890664872148992/5175126441197568")
time.sleep(100)
print("loaded page")
res5 = self.seleniumBasicUtils.sendCommand("Runtime.evaluate", params5)
print("completed")
print(res1, res2, res3, res4, res5)
time.sleep(100)
# self.browser.get(
# "https://www.educative.io/module/page/MjprXLCkmQNnQGAvK/10370001/5890664872148992/5175126441197568")
# self.seleniumBasicUtils.sendCommand("Page.setBypassCSP", params1)
# self.seleniumBasicUtils.sendCommand("Security.setIgnoreCertificateErrors", params2)
# self.seleniumBasicUtils.sendCommand("Page.addScriptToEvaluateOnNewDocument", params4)
# self.seleniumBasicUtils.sendCommand("Page.addScriptToEvaluateOnNewDocument", params3)
print("done")
while True:
pass
except KeyboardInterrupt:
self.logger.error("Keyboard Interrupt")
except Exception as e:
lineNumber = e.__traceback__.tb_lineno
self.logger.error(f"start: {lineNumber}: {e}")
finally:
self.logger.debug("Exiting...")
asyncio.get_event_loop().run_until_complete(self.browserUtil.shutdownChromeViaWebsocket())
def checkIfLoggedIn(self):
self.logger = Logger(self.configJson, "LoginAccount").logger
self.logger.info("Checking if logged in...")
isLoggedIn = bool(self.browser.execute_script(
'''return document.cookie.includes('logged_in')'''))
if not isLoggedIn:
raise Exception("Login to your account in the browser...")
Kind of mixed up the code. seems to be working. but need to refactor it. What I understood is that before requesting a page, I need to setup the CDP so that the singleFile scripts will be injected automatically,
@gildas-lormeau Could not implement the CDP injection correctly using the above code. There are some issues which I can't understand. Probably the iframes are getting loaded few seconds after so, using the above solution to inject in iframes is not working correctly if the page is requested multiple times.
I tried to understand the code in single-file-cli project. Tried to implement it as well but I am not sure if I am doing it correctly.
script = self.browser.execute_script(f"{singleFileJs} return script;")
hookScript = self.browser.execute_script(f"{singleFileJs} return hookScript;")
params1 = {"enabled": True}
params2 = {"ignore": True}
params3 = {
"source": hookScript,
"runImmediately": True
}
script += """(function initSingleFile() { singlefile.init({ fetch: (url, options) => { return new Promise(function (resolve, reject) { const xhrRequest = new XMLHttpRequest(); xhrRequest.withCredentials = true; xhrRequest.responseType = "arraybuffer"; xhrRequest.onerror = event => reject(new Error(event.detail)); xhrRequest.onabort = () => reject(new Error("aborted")); xhrRequest.onreadystatechange = () => { if (xhrRequest.readyState == XMLHttpRequest.DONE) { resolve({ arrayBuffer: async () => xhrRequest.response || new ArrayBuffer(), headers: { get: headerName => xhrRequest.getResponseHeader(headerName) }, status: xhrRequest.status }); } }; xhrRequest.open("GET", url, true); if (options.headers) { for (const entry of Object.entries(options.headers)) { xhrRequest.setRequestHeader(entry[0], entry[1]); } } xhrRequest.send(); }); } }); })();"""
params4 = {
"source": script,
"runImmediately": True,
"worldName": "SINGLE_FILE_WORLD_NAME"
}
self.seleniumBasicUtils.sendCommand('Page.enable', {})
self.seleniumBasicUtils.sendCommand("Page.setBypassCSP", params1)
self.seleniumBasicUtils.sendCommand("Security.setIgnoreCertificateErrors", params2)
self.seleniumBasicUtils.sendCommand("Page.addScriptToEvaluateOnNewDocument", params3)
self.seleniumBasicUtils.sendCommand("Page.addScriptToEvaluateOnNewDocument", params4)
When I try to get the pageData, it fails. singleFile is not defined.
It's probably due the fact that you try to run the script in an isolated world. Removing "worldName": "SINGLE_FILE_WORLD_NAME"
might fix the issue.
@gildas-lormeau what i saw is that the selenium browser is able to load the iframe when first load in a tab but after requesting the same URL again, the iframes do not get loaded properly which causes issue. So the alternative was to create new tab and request the URL which removes the iframe issue.
@gildas-lormeau Hi Can you please explain what does this do? Is it important to implement the solution in my scraper as well.
source += await readScriptFiles(options && options.browserScripts ? options.browserScripts : []);
if (options.browserStylesheets && options.browserStylesheets.length) {
source += "addEventListener(\"load\",()=>{const styleElement=document.createElement(\"style\");styleElement.textContent=" + JSON.stringify(await readScriptFiles(options.browserStylesheets)) + ";document.body.appendChild(styleElement);});";
}
3.7.4: https://github.com/anilabhadatta/educative.io_scraper/commit/c77089b0bbf4a340a084c6ea44ac8448398196af 3.7.3: https://github.com/anilabhadatta/educative.io_scraper/commit/c0a102dfb2b5f154799b9e66043473d3a122b42a
I have recently committed a code which basically uses the previous comment code logic. It generates the single file data correctly. Only issue I was facing that is now resolved is by creating new tabs before injecting the scripts through cdp. But I want to know if the above code is also required and what should I do for the approach?
Thanks in advance
Sorry for the late reply, the code you cited allows the user to inject custom scripts and/or stylesheets. This code is not required.
@gildas-lormeau ohh okay, thanks for your support
@gildas-lormeau
Describe the bug When saving a page using extension / via console, too many requests are being sent, and HTML creation is inconsistent.
To Reproduce Steps to reproduce the behavior: I have created a scraper that scrapes educative.io, I have integrated your single-file library, reference: https://github.com/gildas-lormeau/SingleFile/wiki/How-to-integrate-SingleFile-library-code-in-%22custom%22-environments%3F Some pages have a huge amount of images for eg: Downloads.zip I have attached a zip containing the HTML scraped using the extension and via console.
I have noticed that while scraping, too many requests are made by single-file even though the page is loaded properly. This causes captcha issues and causes the scraper to fail and also saves the inconsistent HTML, the image is showing properly in the browser but not in the saved HTML. I have used the DEFAULT_CONFIG and the config attached in your wiki, both generates the same problem.
Screenshots
Environment
Please check the following, if possible kindly save the HTML from the browser without making any further requests, and the image should also be saved properly.