Open hellodword opened 2 years ago
That is a fantastic idea. Given the original requirement, we implemented similar features in screenshot, but it is still not what you expected.
Perhaps we can take things further and develop a piecemeal approach here.
The biggest challenge for me is developing or choosing a script and its interpreter.
I have no experience of this before, but rod has a good api. :smile:
I will try to implement this, I really prefer this mode, dealing with all elements is too hard.
If this happens, I'd like the feature to be optional if it require a lot /complex of external dependencies. I'm trying to maintain shiori as simple as possible and we just get rid of CGO switching the SQLite driver, and since wasm is obsolete and I have to start replacing it with Obelisk, the more seamless the experience it can be, the better :)
If this happens, I'd like the feature to be optional if it require a lot /complex of external dependencies. I'm trying to maintain shiori as simple as possible and we just get rid of CGO switching the SQLite driver, and since wasm is obsolete and I have to start replacing it with Obelisk, the more seamless the experience it can be, the better :)
I'm also a big fan of CGo-free and fewer dependencies, chromedp and rod are based on Chrome DevTools Protocol, without CGo or tons of dependencies. :smiley:
@fmartingr Please don't be worried about complex external dependencies. Perhaps we can look forward to the given works.
Anyway, pr is wecome.
Hey, I created a simple demo.
https://github.com/hellodword/web-archiving-with-headless-chromium-demo
env rod=show,bin=/path/to/chrome go run .
It's very simple, but provides custom post.js
for hooking and pre.js
for scroll/click/...
And use singlefile for saving.
This implementation is creative and far superior to our current archiving solution. However, it is heavily dependant on SingleFile, and it appears that obelisk is no longer required. If we stick to this plan, a new project might be a better alternative.
@fmartingr What do you think?
it is heavily dependant on SingleFile
Right, and it's buggy in this demo. 😂
But it's optional, just like the archivebox, archivebox has multi saving modes, singlefile is only one of them.
The thing I want to show is ability of custom script, and, a highly recommend cdp library of golang, I think it's much better than chromedp.
Appreciate the time and effort. Personally, I prefer the option of trying to inject the script in headless over the one implemented in the screenshot project.
It appears that making it an option would be reasonable, so if SingleFile is added as a browser extension, I would prefer to put the gitmodule in the .github/thridparty
directory.
An example of archiving results using screenshot
:
This implementation is creative and far superior to our current archiving solution. However, it is heavily dependant on SingleFile, and it appears that obelisk is no longer required. If we stick to this plan, a new project might be a better alternative.
@fmartingr What do you think?
I still haven't started migrating to obelisk just yet... it will be an interesting amount of work to perform and I do not have much time to spare this weeks (and most is invested in replying issues and PRs, yay FOSS! :joy:).
My comment was regarding more the current state of shiori and some comments by our packages in regard of external dependecies or ecosystems. For me the ideal solution is to import obelisk
without much trouble and don't lose the ability to cross compile or requiring external software for the archive to work. If you want to add that to obelisk, I'd say it to be optional for users (you can either build it with --tags XXXX
or require anything else).
That said, I don't want my comments/vision to halt obelisk's progress! I'm just expressing my fears from an user perspective, not imposing anything. I haven't use any library like this in a while (and not in the Go world, anyway) so I just wanted to make sure I don't create future problems for shiori. You folks are the experts here :)
Right, it was a demo so I directly use singlefile as an embedded dependency, it could or should be act as a plugin
.
I think nowadays archiving tool do not necessarily need a chromium, but need ability of scripting extension, one reason is there're too much anti-bot stuff (captcha, WAF, and so on) on the internet.
there're too much anti-bot stuff (captcha, WAF, and so on) on the internet.
I'm interested in this somehow, so let's do it.
Related to wabarc/wayback#92
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days
It seems to me that the ideal solution would be the ability to prepare the page for saving not on the server, but on the client. And send it to the server.
Now a lot of sites use dynamic image loading, captcha checking, they load comments only if you scroll the page to them (and comments are sometimes more interesting than the article itself), they don’t load all comments (hide discussion threads until you force them to open). Lots of dynamics. Therefore, it is better to save the page after having previously examined it with your own eyes, that all that is needed is loaded and displayed. There is no universal solution here, so it is preferable to inspect the page yourself.
I just looked into my Pocket archive and it became very sad - many domains are already partitioned, there are no sites. And the pages themselves (at a premium tariff) are far from being completely saved, sometimes they don’t even have text. And now I'm looking for a solution to this problem. I have now started saving pages through SingleFile, but if you tie it to shiori, it will be just the perfect bookmark manager.
At the same time, I would like shiori not to save the text to its database (perhaps only for a quick search), but always retrieve it again from the saved page. Because text content recognition algorithms will always improve, and the content stored in the database may be incorrectly recognized and no longer relevant from the new version of the application.
@Katarn Thank you for your offer, it's a fantastic idea. As intended, obelisk
should support both headless
and non-headless
mode for archiving webpage.
if you tie it to shiori, it will be just the perfect bookmark manager.
Makes shiori
work with obelisk
is related to https://github.com/go-shiori/shiori/issues/353
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days
Just like archivebox, I think archivebox is very nice, but there're two issues:
And I found a great golang lib rod, how about adding a mode of using headless (or headful, it depends) chromium?