Bold suggestion: save full content of the page.

urza commented 9 years ago

I know this is a bold one, but hear me out.

About half of my links that are older then 3 years are broken, the content disappeared from web. I discovered this the hard way of course. Sometimes it doesn't matter, but sometimes the webpage in my bookmarks was really useful. But lifetime of URLs on internet is on average just few years.

I know there is discussion about archiving links in archive.org #307 and I vouch for this, this enhancement would be very useful.

But I have one other possible suggestion: How about saving the full content locally to filesystem structure? Would it be hard? If we save text it would allow for fulltext search (not just links, but also their content), which would bring Shaarli to completely new level of usability. I currently do this with Evernote - links that I consider important, I save both to my shaarli bookmarks (because I already have years of records in here, so I want to keep it consistent), but also to my Evernote which retrieves the whole content of the page, including pictures and media, alows me to organize by notebooks, tags and even decide how I want to clip the page (full, article, just text)... then I can fulltext search it... If shaarli could match this somehow, it would be just fantastic...

Maybe combination of saving full text from the link to allow for fulltext search, and taking screenshot of the page with something line PhantomJS http://phantomjs.org/screen-capture.html to save the visual aspect of the page?

What do you guys think?

nicolasdanelon commented 9 years ago

Very interesting.. +1

ArthurHoaro commented 9 years ago

Well... this is a difficult subject, but an interesting feature.

What's can be done:

Save page HTML: unreadable most of the time, useless.
Save page + media: we do this in Projet-Autoblog: can be done with a link option. Although we rely on RSS feed which is easier than full page.
Save content: project like wallabag try to do this. Definitly won't do it here.
Screeshots: also an option, maybe the easiest, but won't allow fulltext search.

Any opinion would be welcome here.

virtualtam commented 9 years ago

Some thoughts and related tools:

some pages may just be near to unreadable (programatically speaking): frames, AJAX, embedded JSON data loaded in JS, CSS3 renders, etc.
archiving the whole page (HTML content) may be way too verbose, compared to the actual, relevant information
the Firefox Reader View is developing the ability to isolate worthy content, there may be libraries available to trim pages a bit
the Firefox Resurrect Pages addon allows to browse the major Internet caches
Shaarchiver browses a link list and downloads media content (audio, video) with youtube-dl -which supports far more than only youtube, see the available content extractors

urza commented 9 years ago

Shaarchiver looks nice... it does save the full html page content in addition to media you mention? If so, it solves the problem for me.. setting it in cron job or something...

virtualtam commented 9 years ago

AFAIK, downloading page content is in the TODO list

mro commented 9 years ago

-1

@urza, @virtualtam what you describe doesn't fit with "The personal, minimalist, super-fast, no-database delicious clone" so it should IMHO not go into shaarli because it is not shaarli. I'd even go as far as saying it must not.

what you describe is "The personal, ... http://archive.org clone" (not respecting robots.txt) - which is another beast.

BTW, have you seen the support for archive.org in shaarli?

dimtion commented 9 years ago

Are there any downsides of using archive.org ? Otherwise I agree with @mro, why bother doing such hard work for something which already exists.

nodiscc commented 9 years ago

Again, archiving features should remain in a separate tool. Shaarli should provide export formats that makes it easy for such tools to parse and work with data.

shaarchiver is easy enough to use and just relies on Shaarli's HTML export (could be improved to parse RSS) - though it doesn't support archiving webpages yet, mainly because I'd like to do it right on first try (I did not decide whether this will use python-scrapy, or an external tool like httrack/wget, it needs better error handling, etc...) Help is welcome.

If you want a PHP based tool, there is https://github.com/broncowdd/respawn (didn't try it yet).

Shaarli is not a web scraping system, and I think it's fine this way.

urza commented 9 years ago

Yes, it makes sense. Keep Shaarli simple and provide good interface for other tools to do the downloading-archiving... It is a project of itself. I totally agree lets keep the jobs separated.

@nodiscc .. Any time estimation when Shaarchiver could be able to do it? Personally I would go with the wget route.. wget has imo solved all the problems, it is battle tested solution, that can download "full page content" so it fits the philosophy... but I dont know about the other options you mention, maybe they are just as good...

nodiscc commented 9 years ago

@urza no clear ETA for shaarchiver, but page archiving should definitely be working before 2016 :) I'm low on free time.

For the record there was an interesting discussion regarding link rot at https://news.ycombinator.com/item?id=8217133 (link rot is a real problem, it is even more obvious on some content types such as multimedia/audio/video works; ~10% of my music/video bookmarks are now broken)

I think it will use wget, I need to find out a sensible directory organization, and it will take some work filtering/organizing downloaded files (I don't want to download 15GB of ads..).

mro commented 9 years ago

mabe blacklists like https://github.com/sononum/abloprox/ or http://wpad.mro.name/ can help? The latter uses a blacklist during the proxy configuration. Don't know whether wget can handle such if/or how the operating system helps here.

nodiscc commented 9 years ago

@mro the ad blocking system will likely use dnsmasq and hosts files. I already collected some relevant lists and conversion tools: https://github.com/nodiscc/shaarchiver/blob/master/ad-hosts.txt https://github.com/Andrwe/privoxy-blocklist/blob/master/privoxy-blocklist.sh. Adding abloprox to these. I will need to investigate a bit more - other suggestions are welcome.

github-account1111 commented 3 years ago

Might be worth using a web-scraper (akin to the one Evernote use) over a separate client. A faithful webpage copy is always preferable to blindly passing a URL into a server because:

content filtering and ad blocking can mess things up
not being logged in can mess things up
custom CSS can mess things up

So you might (I would argue will) end up in a situation where what you find in your archive is nothing like what you saw in your browser, especially if you tend to use custom CSS for things like dark mode or tools like uBlock Origin.

virtadpt commented 3 years ago

There is already a plugin which sends the bookmarked link to an arbitrary Wallabag instance.

shaarli / Shaarli

Bold suggestion: save full content of the page. #318