go-shiori / obelisk

Go package and CLI tool for saving web page as single HTML file
MIT License
246 stars 17 forks source link

Allow the option to archive with a headless browser #14

Open hellodword opened 2 years ago

hellodword commented 2 years ago

Just like archivebox, I think archivebox is very nice, but there're two issues:

  1. slow, not a big deal;
  2. custom automation for special pages (lazy loading for example), this issue is working on it.

And I found a great golang lib rod, how about adding a mode of using headless (or headful, it depends) chromium?

waybackarchiver commented 2 years ago

That is a fantastic idea. Given the original requirement, we implemented similar features in screenshot, but it is still not what you expected.

Perhaps we can take things further and develop a piecemeal approach here.

hellodword commented 2 years ago

The biggest challenge for me is developing or choosing a script and its interpreter.

I have no experience of this before, but rod has a good api. :smile:

I will try to implement this, I really prefer this mode, dealing with all elements is too hard.

fmartingr commented 2 years ago

If this happens, I'd like the feature to be optional if it require a lot /complex of external dependencies. I'm trying to maintain shiori as simple as possible and we just get rid of CGO switching the SQLite driver, and since wasm is obsolete and I have to start replacing it with Obelisk, the more seamless the experience it can be, the better :)

hellodword commented 2 years ago

If this happens, I'd like the feature to be optional if it require a lot /complex of external dependencies. I'm trying to maintain shiori as simple as possible and we just get rid of CGO switching the SQLite driver, and since wasm is obsolete and I have to start replacing it with Obelisk, the more seamless the experience it can be, the better :)

I'm also a big fan of CGo-free and fewer dependencies, chromedp and rod are based on Chrome DevTools Protocol, without CGo or tons of dependencies. :smiley:

waybackarchiver commented 2 years ago

@fmartingr Please don't be worried about complex external dependencies. Perhaps we can look forward to the given works.

Anyway, pr is wecome.

hellodword commented 2 years ago

Hey, I created a simple demo.

https://github.com/hellodword/web-archiving-with-headless-chromium-demo

env rod=show,bin=/path/to/chrome go run .

It's very simple, but provides custom post.js for hooking and pre.js for scroll/click/...

And use singlefile for saving.

waybackarchiver commented 2 years ago

This implementation is creative and far superior to our current archiving solution. However, it is heavily dependant on SingleFile, and it appears that obelisk is no longer required. If we stick to this plan, a new project might be a better alternative.

@fmartingr What do you think?

hellodword commented 2 years ago

it is heavily dependant on SingleFile

Right, and it's buggy in this demo. 😂

But it's optional, just like the archivebox, archivebox has multi saving modes, singlefile is only one of them.

The thing I want to show is ability of custom script, and, a highly recommend cdp library of golang, I think it's much better than chromedp.

waybackarchiver commented 2 years ago

Appreciate the time and effort. Personally, I prefer the option of trying to inject the script in headless over the one implemented in the screenshot project.

It appears that making it an option would be reasonable, so if SingleFile is added as a browser extension, I would prefer to put the gitmodule in the .github/thridparty directory.

An example of archiving results using screenshot:

image

fmartingr commented 2 years ago

This implementation is creative and far superior to our current archiving solution. However, it is heavily dependant on SingleFile, and it appears that obelisk is no longer required. If we stick to this plan, a new project might be a better alternative.

@fmartingr What do you think?

I still haven't started migrating to obelisk just yet... it will be an interesting amount of work to perform and I do not have much time to spare this weeks (and most is invested in replying issues and PRs, yay FOSS! :joy:).

My comment was regarding more the current state of shiori and some comments by our packages in regard of external dependecies or ecosystems. For me the ideal solution is to import obelisk without much trouble and don't lose the ability to cross compile or requiring external software for the archive to work. If you want to add that to obelisk, I'd say it to be optional for users (you can either build it with --tags XXXX or require anything else).

That said, I don't want my comments/vision to halt obelisk's progress! I'm just expressing my fears from an user perspective, not imposing anything. I haven't use any library like this in a while (and not in the Go world, anyway) so I just wanted to make sure I don't create future problems for shiori. You folks are the experts here :)

hellodword commented 2 years ago

Right, it was a demo so I directly use singlefile as an embedded dependency, it could or should be act as a plugin.

I think nowadays archiving tool do not necessarily need a chromium, but need ability of scripting extension, one reason is there're too much anti-bot stuff (captcha, WAF, and so on) on the internet.

waybackarchiver commented 2 years ago

there're too much anti-bot stuff (captcha, WAF, and so on) on the internet.

I'm interested in this somehow, so let's do it.

Related to wabarc/wayback#92

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days

Katarn commented 1 year ago

It seems to me that the ideal solution would be the ability to prepare the page for saving not on the server, but on the client. And send it to the server.

Now a lot of sites use dynamic image loading, captcha checking, they load comments only if you scroll the page to them (and comments are sometimes more interesting than the article itself), they don’t load all comments (hide discussion threads until you force them to open). Lots of dynamics. Therefore, it is better to save the page after having previously examined it with your own eyes, that all that is needed is loaded and displayed. There is no universal solution here, so it is preferable to inspect the page yourself.

I just looked into my Pocket archive and it became very sad - many domains are already partitioned, there are no sites. And the pages themselves (at a premium tariff) are far from being completely saved, sometimes they don’t even have text. And now I'm looking for a solution to this problem. I have now started saving pages through SingleFile, but if you tie it to shiori, it will be just the perfect bookmark manager.

At the same time, I would like shiori not to save the text to its database (perhaps only for a quick search), but always retrieve it again from the saved page. Because text content recognition algorithms will always improve, and the content stored in the database may be incorrectly recognized and no longer relevant from the new version of the application.

waybackarchiver commented 1 year ago

@Katarn Thank you for your offer, it's a fantastic idea. As intended, obelisk should support both headless and non-headless mode for archiving webpage.

if you tie it to shiori, it will be just the perfect bookmark manager.

Makes shiori work with obelisk is related to https://github.com/go-shiori/shiori/issues/353

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days