turtl / tracker

This project is for tracking issues, bug reports, and progress on the entire Turtl project.
67 stars 3 forks source link

Import from ArchiveBox #307

Open fire-pig opened 5 years ago

fire-pig commented 5 years ago

ArchiveBox is commandline based open source self-hosted web archiver:

https://archivebox.io/ https://github.com/pirate/ArchiveBox

I would like to be able to:

"ArchiveBox takes a list of website URLs you want to archive, and creates a local, static, browsable HTML clone of the content from those websites (it saves HTML, JS, media files, PDFs, images and more).

You can use it to preserve access to websites you care about by storing them locally offline. ArchiveBox imports lists of URLs, renders the pages in a headless, autheticated, user-scriptable browser, and then archives the content in multiple redundant common formats (HTML, PDF, PNG, WARC) that will last long after the originals disappear off the internet. It automatically extracts assets and media from pages and saves them in easily-accessible folders, with out-of-the-box support for extracting git repositories, audio, video, subtitles, images, PDFs, and more."

orthecreedence commented 5 years ago

Hi. This is a cool idea, but I think is a bit out of scope for the Turtl project. Interfacing with other apps is difficult because Turtl uses an encrypted storage format that ArchiveBox would need to be able to manipulate directly. It's possible to export a C api from the Turtl core that would allow doing this on a low-level, but would still take coordination with ArchiveBox devs to achieve with any success.

Import/export from/to ArchiveBox might be more feasible, so for now I'll mark this issue as an import format request and go from there.