gildas-lormeau / SingleFile

Web Extension for saving a faithful copy of a complete web page in a single HTML file
GNU Affero General Public License v3.0
15.78k stars 1.02k forks source link

Seeking help, guidance, suggestions from the wise, experienced, and powerful. =] #231

Closed david-littlefield closed 5 years ago

david-littlefield commented 5 years ago

Hi @gildas-lormeau ,

Update: I forked SingleFile, but I couldn't figure out where to modify it. So, I built an html file downloader with user login based on SingleFile using Apple's WebKit framework. It worked great for Twitter and Facebook. But, there's a layout issue on sites like airbnb.com. The css seems to have loaded, but the layout is broken - repeated images, large height and width gaps, and missing characters (square box).

Also, I recently found Apple's JavaScript Core framework, which can load and use JavaScript libraries in Swift. So, I'm exploring if there's a way to use SingleFile with JavaScript Core.

Question: SingleFile downloads airbnb.com perfectly, so I was wondering if you knew offhand what could be causing the problem?

Any help, guidance, suggestions would be very much appreciated. =]

gildas-lormeau commented 5 years ago

I'm unsure swift is the best choice. If I were you, I would write a puppeteer or a webdriver/selenium script.

gildas-lormeau commented 5 years ago

@captaindavepdx Any comment? :p

david-littlefield commented 5 years ago

Hi @gildas-lormeau!

Sorry for the late reply, I've been immersed in JavaScript Core, and I lost track of the time, haha.

Comments: Thanks for the feedback, but I'm kind of stuck with Swift. Luckily, JavaScript Core can run pure JavaScript within Swift. It has limitations, but I've been working through them one at a time. If JavaScript Core didn't work, then I would've gone the Puppeteer route. =p

Update: Made some progress since then, but I'm still trying to resolve the broken layout on websites like airbnb.com - not sure what the actual problem is...

If you have additional suggestions, feedback, or direction, it'd be gladly welcomed! =]

Progress:

gildas-lormeau commented 5 years ago

@captaindavepdx. Did you try to inject this list of scripts in JavaScript Core? https://github.com/gildas-lormeau/SingleFile/blob/862d72073d6f35f8ffb60ed8ccd815cf12f0384e/cli/back-ends/puppeteer.js#L31-L51

And then execute this code ? https://github.com/gildas-lormeau/SingleFile/blob/862d72073d6f35f8ffb60ed8ccd815cf12f0384e/cli/back-ends/puppeteer.js#L104-L122

david-littlefield commented 5 years ago

Hi @gildas-lormeau!

Thanks for the suggestion, it helped me realize I needed to add a base url to the links in the css files.

I feel pretty good using JavaScript with Swift in a single js file, but I'm still figuring out how to do it with multiple files. I think I have to specify each function I want to use with a special Swift export function. There's not much documentation available on it, so I've been building the website downloader from scratch within a single js file.

By adding a base url to the href and src links in the css file, it's fixed a lot of missing font and icon issues. While it works great on websites like Instagram and Twitter, it doesn't do well on airbnb.com. It seems like there's many height and width values that are percentages, which cause oversized images.

I wrote a script that replaced percentages with "auto," but then new issues appeared. And, the layout is still broken on the airbnb website.

Is this something similar to anything you had to work through?

Instagram and Airbnb Html

gildas-lormeau commented 5 years ago

If I understand you correctly, you are re-implementing from scratch SingleFile. As you can imagine, this is a lot of work (approx. 10,000 lines of code) and I'm not sure this is the simplest thing to do. Unfortunately, I won't have the time to fix issues in your implementation.

david-littlefield commented 5 years ago

Hi @gildas-lormeau!

Wow, I didn't realize there was that much code. I started building that project because it didn't look like SingleFile could be used in Swift. But, that process has helped me learn a lot about JavaScript. And, based on your latest suggestion, combined with my improved understanding of JavaScript, integrating SingleFile into Swift seems promising!

I've been immersed in another project for the past several days, but that should finish very soon. I'll be jumping right back into SingleFile and JavaScript Core afterward!

Thanks for the feedback @gildas-lormeau!

david-littlefield commented 5 years ago

Hi @gildas-lormeau!

I just finished the project I was working on - it took a lot longer than I anticipated. But, I'm back on the SingleFile and JavaScript Core quest! I'm going to try your recent suggestion, and I'll let you know how it goes. Thanks again!

gildas-lormeau commented 5 years ago

You're welcome @captaindavepdx . Feel free to ask questions here if necessary ;)

david-littlefield commented 5 years ago

Awesome! Ok, it seems like Swift loads the js files as if it pasted the contents of each file into the browser using developer tools console.

Currently, I'm getting the following error: Promise {status: "rejected", result: ReferenceError: Can't find variable: singlefile} = $1

From exploring potential causes, I'm wondering:

Right now, I'm researching how do to that.

Does that sound about right, or would you suggest something else?

gildas-lormeau commented 5 years ago

I think you may need to replace all this.singlefile.. occurrences with window.singlefile.... This is what I need to do when I run the code via WebDriver in gecko-based browsers. Here is the code that does the replace: https://github.com/gildas-lormeau/SingleFile/blob/862d72073d6f35f8ffb60ed8ccd815cf12f0384e/cli/back-ends/webdriver-gecko.js#L100.

This is the interesting part .replace(/\n(this)\.([^ ]+) = (this)\.([^ ]+) \|\|/g, "\nwindow.$2 = window.$4 ||"))

david-littlefield commented 5 years ago

Woah, that's cool! Ok, so I replaced all occurrences of "this.singlefile" with "window.singlefile"

It still displays an error: Promise {status: "rejected", result: ReferenceError: Can't find variable: singlefile} = $2

I was browsing through the js files, I'm pretty new to javascript, and I couldn't tell where "singlefile" was initially declared.

gildas-lormeau commented 5 years ago

The singlefile object is created in the index.js file in the root folder.

https://github.com/gildas-lormeau/SingleFile/blob/1de14bdbb08c4e0d5fdd72201bfade4e45c1b167/index.js#L24-L54

gildas-lormeau commented 5 years ago

You can also try to replace in this file this.singlefile = this.singlefile || { with var singlefile = {

david-littlefield commented 5 years ago

Ok, did that. It makes sense that it should work now, but it still has the same error. For debugging, I started incrementally loading each script into the browser. That's when I noticed there were local folder references in the js files. scriptElement.src = browser.runtime.getURL("/lib/hooks/content/content-hooks-frames-web.js");

The Swift app loads the contents of the js files, but its not referencing from the actual js files. Do I need to include the full local folder references in the js files? scriptElement.src = browser.runtime.getURL("/Users/davidlittlefield/Desktop/SingleFile/lib/hooks/content/content-hooks-frames-web.js");

Or load those js files into the Swift app as well?

If possible, it'd be awesome to save all the needed js files in the app, so the app can run SingleFile without needing to download anything from npm.

gildas-lormeau commented 5 years ago

Indeed, I forgot to mention you have to implement the singlefile.lib.getFileContent method to circumvent your issue. This method is called twice and should return the content of the /lib/hooks/content/content-hooks-web.js and /lib/hooks/content/content-hooks-frames-web.js files. It means that if you run the following code after injecting SingleFile code, it should work (dump the code of the 2 files in the returned string).

singlefile.lib.getFileContent = function() {
    return `
         // dump here the content of /lib/hooks/content/content-hooks-web.js
     `   // dump here the content of /lib/hooks/content/content-hooks-frames-web.js
    `;
}

Edit: Maybe this procedure is not simplest one, does browser.runtime.getURL exist in your environment?

gildas-lormeau commented 5 years ago

FYI, here is how I've implemented this function to run SingleFile with Puppeteer

https://github.com/gildas-lormeau/SingleFile/blob/862d72073d6f35f8ffb60ed8ccd815cf12f0384e/cli/back-ends/puppeteer.js#L90-L94

david-littlefield commented 5 years ago

I'm using the latest version of Chrome. I didn't see that as an option in the console. Chrome docs reference it as chrome.runtime.getURL(), but it doesn't work when I type it into the console.

https://developer.chrome.com/extensions/runtime#method-getURL

In your puppeteer file, is that a dictionary where the keys are the file path, and the values are the contents of the js file?

I ended up manually putting all the js files into a single js file, typing the whole path for both of those functions:

if (this.browser && browser.runtime && browser.runtime.getURL) {
    scriptElement.src = browser.runtime.getURL("/Users/davidlittlefield/Desktop/SingleFile" + "/lib/hooks/content/content-hooks-frames-web.js");
    scriptElement.async = false;
} else if (this.singlefile.lib.getFileContent) {
    scriptElement.textContent = this.singlefile.lib.getFileContent("/Users/davidlittlefield/Desktop/SingleFile" + "/lib/hooks/content/content-hooks-frames-web.js");
}

Now, it can find the singlefile variable. And a new error appears: Unhandled Promise Rejection: ReferenceError: Can't find variable: options

Am I doing the those functions wrong? Some of your code is more advanced than I've worked with, so I only kind of understand whats happening.

gildas-lormeau commented 5 years ago

You're making progress but the code you posted won't work as you expect. The problem is that I suppose the condition this.browser && browser.runtime && browser.runtime.getURL is never true in your environment so the code in the first block of the if will never be executed.

singlefile.lib.getContent is called here in the code of the SingleFile:

singlefile.lib.getContent should return the code (as a string) that corresponds to the path given as parameter. That's why I use a dictionary in the puppeteer implementation. However, you can also simply concatenate the 2 scripts (/lib/hooks/content/content-hooks-web.js and /lib/hooks/content/content-hooks-frames-web.js) and return the whole string without taking the path parameter into account, as I suggested. This should also work because there won't be any clash between the 2 scripts.

Regarding the options error, you have indeed to declare it and assign it to an object (i.e. const options = {}) before running the code I pasted here https://github.com/gildas-lormeau/SingleFile/issues/231#issuecomment-506988406.

david-littlefield commented 5 years ago

Ok, I tried to use singlefile.lib.getContent but it doesn't appear to be an option? Maybe I forgot to load one of the js files? Which js file was getContent from?

Screen Shot 2019-07-25 at 2 54 50 PM
gildas-lormeau commented 5 years ago

The method singlefile.lib.getContent does not exist by default, you have to define it and inject it.

david-littlefield commented 5 years ago

Ok, cool.

gildas-lormeau commented 5 years ago

I never wrote swift code in my life. I guess it would look like this.

let hooksFrameURL = dir.appendingPathComponent("/Users/davidlittlefield/Desktop/SingleFile/lib/hooks/content/content-hooks-frames-web.js")
let hooksURL = dir.appendingPathComponent("/Users/davidlittlefield/Desktop/SingleFile/lib/hooks/content/content-hooksweb.js")
let textHooksFrame = ""
let textHooks = ""
do {
    textHooksFrame = try String(contentsOf: hooksFrameURL , encoding: .utf8)    
    textHooks = try String(contentsOf: hooksURL , encoding: .utf8)
}
catch {/* error handling here */}
let script = "singlefile.lib.getFileContent () => " + textHooksFrame + textHooks
// inject the script into JavaScript core

I used this code as example https://stackoverflow.com/questions/24097826/read-and-write-a-string-from-text-file.

david-littlefield commented 5 years ago

Haha, that's pretty good Swift! I was looking into how to load the contents of js files in JavaScript, which several StackOverflow posts said it wasn't possible in the browser for security reasons.

All of the SingleFile js files are in one js file that is stored in my app. That js file is separate from the Swift files. So, I can load the entire contents of that file into the webViewat launch, and then inject js into the webView from Swift afterward.

But, the challenge seems to be injecting the contents of those files from Swift at launch, assuming that SingleFile needs it at launch. Which I'm pretty sure is doable. I'd need to store the contents of that js file in the swift file as a string. Then, I could inject what we need using string interpolation.

When does SingleFile need the contents of those files?

gildas-lormeau commented 5 years ago

Let me suggest you an acceptable alternative to the injection of the singlefile.lib.getContent method.

I think this is the simplest way to solve this issue.

david-littlefield commented 5 years ago

Awesome! I didn't realize I could literally paste the contents of the file as a multiline string in the js file, haha. That approach no longer needs the file path concatenated to it, right?

gildas-lormeau commented 5 years ago

This is the main feature of the ` (backquote) delimiter and it's quite useful indeed.

With this approach, the issues related to singlefile.lib.getFileContent will be fixed because SingleFile won't call it anymore. You will just have to define the options object (i.e. const options = {};) before launching SingleFile.

gildas-lormeau commented 5 years ago

I don't know if you're using the following API, but I recommend you to inject scripts like the example below in order to inject SingleFile code in all frames and as soon as possible.

...
let scriptToInject = "..."
let contentController = WKUserContentController()
let userScript = WKUserScript(source: scriptToInject, injectionTime: WKUserScriptInjectionTime.atdocumentstart, forMainFrameOnly: false)
contentController.addUserScript(userScript)
...

I used this post as example https://medium.com/@DrawandCode/how-to-communicate-with-iframes-inside-webview-2c9c86436edb

david-littlefield commented 5 years ago

Cool, I've been trying both ways. Right now, I'm still trying to make the changes form your last post. I know we added content-hooks-frames.js and hard coded content-hooks-frames-web.js but I don't remember loading content-hooks.js And, I didn't see it referenced in the other js files from searching for it.

Do I need to add that with the list of js files? Also, does it matter if the script is injected at the document start or end?

gildas-lormeau commented 5 years ago

Let me recap.

david-littlefield commented 5 years ago

I think I've done everything except the second part of #231 (Comment).

I didn't see any reference to content-hooks.js in the concatenated js files, so I'm not sure how to replace the getFileContents in content-hooks.jsportion of the concatenated file.

Sorry, if I'm missing something obvious.

gildas-lormeau commented 5 years ago

Okay, I did not understand that /lib/hooks/content/content-hooks.js was missing in the list... You should inject it, it's a bug in my implementation(s). I'll fix that (https://github.com/gildas-lormeau/SingleFile/issues/247). I updated the previous post accordingly.

gildas-lormeau commented 5 years ago

I updated again the "Recap" post https://github.com/gildas-lormeau/SingleFile/issues/231#issuecomment-515243538 to take into account the frames of the page (cf forMainFrameOnly).

david-littlefield commented 5 years ago

Cool, doing that now. The Swift function requires an injection time, atDocumentStart or atDocumentEnd, does that matter?

gildas-lormeau commented 5 years ago

Yep, to get fonts on https://www.theverge.com/ for example.

Use atDocumentStart for the list of files and atDocumentEnd for the SingleFile script. I updated the post.

david-littlefield commented 5 years ago

I didn't realize I needed to split those into two separate scripts - list of files and singlefile. Doing that now.

gildas-lormeau commented 5 years ago

It's my fault, I was not clear enough. I highly recommend you to automate all these steps with a swift program. Thus, you'll be able to update the code easily in the future.

david-littlefield commented 5 years ago

Good idea! Ok, so it started doing stuff, it went from 0 to 871, but crashed on type error: Unhandled Promise Rejection: TypeError: Type error

Screen Shot 2019-07-25 at 4 38 16 PM Screen Shot 2019-07-25 at 4 39 27 PM
gildas-lormeau commented 5 years ago

What is this "871"? I would also need more details about the type error. Can't you have a stacktrace or a line number, or maybe attach a debugger to the webview?

david-littlefield commented 5 years ago

"871" seemed like it was making progress. It started at 0 then increased incrementally to 465, and then incrementally to 871, then it stopped, and it displayed the error message.

It had a line number, but it pointed to the that line number in the SingleFileScript I just made.

Looking into debug tools, I'm not as familiar because its JavaScript running out of the Safari app, instead of the normal Swift Xcode editor.

gildas-lormeau commented 5 years ago

Okay, please let me know when you have more info. I updated again the procedure https://github.com/gildas-lormeau/SingleFile/issues/231#issuecomment-515235314. There are now blocks of lines to replace instead of single lines. Maybe this will help to fix your issue. Make sure you use backquotes to delimiter the dumped scripts.

david-littlefield commented 5 years ago

Will do, thanks a lot @gildas-lormeau! You've been so incredibly helpful!

gildas-lormeau commented 5 years ago

You're welcome. If it works and if you can automate this with a swift program, please consider open-sourcing your code. It will be very helpful for everyone too! :)

david-littlefield commented 5 years ago

Absolutely!

david-littlefield commented 5 years ago

I forgot to ask, the browser.runtime.getURL didn't matter anymore, right? Because it returns false, so the else statement gets called, which is where we added the contents of js file as a multiline string, right?

Doesn't matter? scriptElement.src = browser.runtime.getURL("/lib/hooks/content/content-hooks-web.js");

Because we added? scriptElement.textContent = ''contents of the js file..."

window.singlefile.lib.hooks.content.main = window.singlefile.lib.hooks.content.main || (() => {

    if (document instanceof HTMLDocument) {
        const scriptElement = document.createElement("script");
        scriptElement.async = false;
        if (this.browser && browser.runtime && browser.runtime.getURL) {
            scriptElement.src = browser.runtime.getURL("/lib/hooks/content/content-hooks-web.js");
            scriptElement.async = false;
        } else if (this.singlefile.lib.getFileContent) {
            scriptElement.textContent = contentHooksWeb;
        }
        (document.documentElement || document).appendChild(scriptElement);
        scriptElement.remove();
    }
    return {};

})();
david-littlefield commented 5 years ago

Ok, I redid the scripts from scratch using your instructions. The same error occurs, but I might have found the stack trace: stack: "[native code]↵parseURL@user-script:1:895:20↵user-script:1:10936:29↵asyncFunctionResume@[native code]↵user-script:1:10913:6…"

Screen Shot 2019-07-25 at 10 20 52 PM
gildas-lormeau commented 5 years ago

What is the URL of the page you're using to test your program? You have also to make sure you execute SingleFile (last step) after navigating to this page.

david-littlefield commented 5 years ago

"www.apple.com" I have it set to run from clicking a button after the page has loaded.

Screen Shot 2019-07-26 at 7 22 00 AM

I also added console.log to the resourceUrl and baseURI.

Screen Shot 2019-07-26 at 7 33 15 AM

Here's the console.log for just the resourceURL

Screen Shot 2019-07-26 at 7 41 03 AM

Console.txt

gildas-lormeau commented 5 years ago

That's weird... Could you please add console.log(document) and console.log(document.baseURI) for example at the top of the script you're running when you click on the button?

david-littlefield commented 5 years ago
Screen Shot 2019-07-26 at 10 28 23 AM