Open urza opened 9 years ago
Very interesting.. +1
Well... this is a difficult subject, but an interesting feature.
What's can be done:
Any opinion would be welcome here.
Some thoughts and related tools:
Shaarchiver looks nice... it does save the full html page content in addition to media you mention? If so, it solves the problem for me.. setting it in cron job or something...
AFAIK, downloading page content is in the TODO list
-1
@urza, @virtualtam what you describe doesn't fit with "The personal, minimalist, super-fast, no-database delicious clone" so it should IMHO not go into shaarli because it is not shaarli. I'd even go as far as saying it must not.
what you describe is "The personal, ... http://archive.org clone" (not respecting robots.txt) - which is another beast.
BTW, have you seen the support for archive.org in shaarli?
Are there any downsides of using archive.org ? Otherwise I agree with @mro, why bother doing such hard work for something which already exists.
Again, archiving features should remain in a separate tool. Shaarli should provide export formats that makes it easy for such tools to parse and work with data.
shaarchiver is easy enough to use and just relies on Shaarli's HTML export (could be improved to parse RSS) - though it doesn't support archiving webpages yet, mainly because I'd like to do it right on first try (I did not decide whether this will use python-scrapy, or an external tool like httrack/wget, it needs better error handling, etc...) Help is welcome.
If you want a PHP based tool, there is https://github.com/broncowdd/respawn (didn't try it yet).
Shaarli is not a web scraping system, and I think it's fine this way.
Yes, it makes sense. Keep Shaarli simple and provide good interface for other tools to do the downloading-archiving... It is a project of itself. I totally agree lets keep the jobs separated.
@nodiscc .. Any time estimation when Shaarchiver could be able to do it? Personally I would go with the wget route.. wget has imo solved all the problems, it is battle tested solution, that can download "full page content" so it fits the philosophy... but I dont know about the other options you mention, maybe they are just as good...
@urza no clear ETA for shaarchiver, but page archiving should definitely be working before 2016 :) I'm low on free time.
For the record there was an interesting discussion regarding link rot at https://news.ycombinator.com/item?id=8217133 (link rot is a real problem, it is even more obvious on some content types such as multimedia/audio/video works; ~10% of my music/video bookmarks are now broken)
I think it will use wget
, I need to find out a sensible directory organization, and it will take some work filtering/organizing downloaded files (I don't want to download 15GB of ads..).
mabe blacklists like https://github.com/sononum/abloprox/ or http://wpad.mro.name/ can help? The latter uses a blacklist during the proxy configuration. Don't know whether wget can handle such if/or how the operating system helps here.
@mro the ad blocking system will likely use dnsmasq
and hosts files. I already collected some relevant lists and conversion tools: https://github.com/nodiscc/shaarchiver/blob/master/ad-hosts.txt https://github.com/Andrwe/privoxy-blocklist/blob/master/privoxy-blocklist.sh. Adding abloprox to these. I will need to investigate a bit more - other suggestions are welcome.
Might be worth using a web-scraper (akin to the one Evernote use) over a separate client. A faithful webpage copy is always preferable to blindly passing a URL into a server because:
So you might (I would argue will) end up in a situation where what you find in your archive is nothing like what you saw in your browser, especially if you tend to use custom CSS for things like dark mode or tools like uBlock Origin.
There is already a plugin which sends the bookmarked link to an arbitrary Wallabag instance.
I know this is a bold one, but hear me out.
About half of my links that are older then 3 years are broken, the content disappeared from web. I discovered this the hard way of course. Sometimes it doesn't matter, but sometimes the webpage in my bookmarks was really useful. But lifetime of URLs on internet is on average just few years.
I know there is discussion about archiving links in archive.org #307 and I vouch for this, this enhancement would be very useful.
But I have one other possible suggestion: How about saving the full content locally to filesystem structure? Would it be hard? If we save text it would allow for fulltext search (not just links, but also their content), which would bring Shaarli to completely new level of usability. I currently do this with Evernote - links that I consider important, I save both to my shaarli bookmarks (because I already have years of records in here, so I want to keep it consistent), but also to my Evernote which retrieves the whole content of the page, including pictures and media, alows me to organize by notebooks, tags and even decide how I want to clip the page (full, article, just text)... then I can fulltext search it... If shaarli could match this somehow, it would be just fantastic...
Maybe combination of saving full text from the link to allow for fulltext search, and taking screenshot of the page with something line PhantomJS http://phantomjs.org/screen-capture.html to save the visual aspect of the page?
What do you guys think?