kunpeng9 / GTD2020-05-31

2020-05-31创建【将github的项目链接等放入滴答清单进行管理或者印象笔记等,实践证明都不可行,不好用,完全被搁置了】
The Unlicense
26 stars 5 forks source link

【自架互联网档案馆 】c9fe/22120: 22120 - Self-host the Internet with an Offline Archive. Like binaries? https://github.com/dosyago/22120/releases Similar to ArchiveBox, SingleFile and WebMemex, but gooderer. #105

Open kunpeng9 opened 3 years ago

kunpeng9 commented 3 years ago

🏛️ ️ 22120 [

](https://hits.seeyoufarm.com)

🏛️ - An archivist browser controller that caches everything you browse, a library server with full text search to serve your archive.

一个档案管理员浏览器控制器,缓存你浏览的一切,一个图书馆服务器与全文搜索,以服务您的档案。

News - 22120 plus interview featured in Console - the open source newsletter

新闻 - 22120 + 采访精选在控制台 - 开源时事通讯

News - new binaries

新闻 - 新的双星系统

License 许可证

Copyright (c) 2018, 2020, Dosyago and/or its affiliates. All rights reserved.

2018,2020,Dosyago and/or its affiliates. 保留所有权利。

This is a release of 22120, a web archiver.

这是 22120 的一个版本,一个网络存档工具。

License information can be found in the LICENSE file.

许可证信息可以在 LICENSE 文件中找到。

This software is dual-licensed. For information about commercial licensing, see Dosyago Commercial License for OEMs, ISVs and VARs.

此软件是双重许可的。有关商业许可的信息,请参阅原始设备制造商、 isv 和 VARs 的 Dosyago 商业许可。

Top

页首

About 关于

This project literally makes your web browsing available COMPLETELY OFFLINE. Your browser does not even know the difference. It's literally that amazing. Yes.

这个项目让你的网页浏览完全脱机。你的浏览器甚至不知道其中的区别。真是太神奇了。是的。

Save your browsing, then switch off the net and go to http://localhost:22120 and switch mode to serve then browse what you browsed before. It all still works.

保存你的浏览记录,然后关闭网络,进入 http://localhost:22120 服务模式,然后浏览你之前浏览过的内容。这一切仍然有效。

warning: if you have Chrome open, it will close it automatically when you open 22120, and relaunch it. You may lose any unsaved work.

警告: 如果你打开 Chrome,当你打开 22/120 时,它会自动关闭,然后重新启动。你可能会失去任何未保存的工作。

Top

页首

Get 22120 得到 22120

3 ways to get it:

三种方法:

  1. Get binary from the 从中获取二进制releases page. 释放页, or ,或
  2. Run with npx: 用 npx 跑步:npx archivist1@latest, or ,或
    • npm i -g archivist1@latest && archivist1
  3. Clone this repo and run as a Node.JS app: npm i && npm start

Also, coming soon is a Chrome Extension.

此外,即将推出的 Chrome 扩展。

Top

页首

Using 使用

Pick save mode or serve mode 选择保存模式或服务模式

Go to http://localhost:22120 in your browser, and follow the instructions.

在你的浏览器中进入 http://localhost:22120,并按照说明操作。

Top

页首

Exploring your 22120 archive 探索你的 22/120 档案

Archive will be located in 22120-arc/public/library*

档案馆位于 22120 弧线 / 公共图书馆 *

But it's not public, don't worry!

但这不是公开的,别担心!

You can also check out the archive index, for a listing of every title in the archive. The index is accessible from the control page, which by default is at http://localhost:22120 (unless you changed the port).

您还可以查看存档索引,以获得存档中每个标题的列表。可以从控制页面访问索引,默认情况下控制页面的 http://localhost:22120 为 100 秒 (除非您更改了端口)。

*Note:22120-arc is the archive root of a single archive, and by defualt it is placed in your home directory. But you can change the parent directory for 22120-arc to have multiple archvies.

* 注意: 22120-arc 是单个归档文件的归档根目录,通过 defualt 将其放置在主目录中。但是您可以将 22120-arc 的父目录更改为有多个 archvie。

Top

页首

Format 格式

The archive format is:

档案格式如下:

22120-arc/public/library/<resource-origin>/<path-hash>.json

22120-arc/public/library/<resource-origin>/< path-hash > . json

Inside the JSON file, is a JSON object with headers, response code, key and a base 64 encoded response body.

在 JSON 文件中,有一个 JSON 对象,包含头、响应代码、键和一个 base 64 编码的响应体。

Top

页首

Why not WARC (or another format like MHTML) ? 为什么不是 WARC (或其他格式,如 MHTML) ?

The case for the 22120 format.

22120 格式的案例。

Other formats (like MHTML and SingleFile) save translations of the resources you archive. They create modifications, such as altering the internal structure of the HTML, changing hyperlinks and URLs into "flat" embedded data URIs, or local references, and require other "hacks* in order to save a"perceptually similar" copy of the archived resource.

其他格式 (如 MHTML 和 SingleFile) 保存存档资源的翻译。它们创建修改,例如改变 HTML 的内部结构,将超链接和 url 更改为 “扁平的” 嵌入式数据 uri 或本地引用,并需要其他 “ hacks * ” ,以保存存档资源的“感知相似” 副本。

22120 throws all that out, and calls rubbish on it. 22120 saves a verbatim high-fidelity copy of the resources your archive. It does not alter their internal structure in any way. Instead it records each resource in its own metadata file. In that way it is more similar to HAR and WARC, but still radically different. Compared to WARC and HAR, our format is radically simplified, throwing out most of the metadata information and unnecessary fields these formats collect.

22120 把这些都扔掉,然后说这些都是垃圾。22120 保存你的文档资源的一个逐字高保真拷贝。它不会以任何方式改变他们的内部结构。相反,它在自己的元数据文件中记录每个资源。在这方面,它更类似于 HAR 和 WARC,但仍然完全不同。与 WARC 和 HAR 相比,我们的格式大大简化了,抛弃了这些格式收集的大部分元数据信息和不必要的字段。

Why?

为什么?

At 22120, we believe in the resources and in verbatim copies. We don't annoint ourselves as all knowing enough to modify the resource source of truth before we archive it, just so it can"fit the format* we choose. We don't believe we need to decorate with obtuse and superfluous metadata. We don't believe we should be modifying or altering resources we archive. We belive we should save them exactly as they were presented. We believe in simplicity. We believe the format should fit (or at least accommodate, and be suited to) the resource, not the other way around. We don't believe in conflating metadata with content; so we separate them. We believe separating metadata and content, and keeping the content pure and altered throughout the archiving process is not only the right thing to do, it simplifies every part of the audit trail, because we know that the modifications between archived copies of a resource of due to changes to the resources themselves, not artefacts of the format or archiving process.

在 22/120,我们相信资源和逐字逐句的副本。我们不会因为自己足够了解而在存档之前修改真相的来源,只是为了 “适合我们选择的格式”。我们不认为我们需要用迟钝和多余的元数据来装饰。我们不认为我们应该修改或改变我们存档的资源。我们相信我们应该完全按照它们所呈现的样子保存它们。我们相信简单。我们相信格式应该适合(或者至少适合,并且适合) 资源,而不是相反。我们不相信把元数据和内容混为一谈; 所以我们把它们分开。我们相信分离元数据和内容,保持内容在整个归档过程中的纯净和更改不仅是正确的事情,它简化了审计线索的每个部分,因为我们知道,由于资源本身的变化,而不是格式或归档过程的人工制品,资源的存档副本之间的修改。

Both SingleFile and MHTML require mutilatious modifications of the resources so that the resources can be "forced to fit" the format. At 22120, we believe this is not required (and in any case should never be performed). We see it as akin to lopping off the arms of a Roman statue in order to fit it into a presentation and security display box. How ridiculous! The web may be a more "pliable" medium but that does not mean we should treat it without respect for its inherent content.

SingleFile 和 MHTML 都需要对资源进行不完整的修改,以使资源能够 “强制适应” 格式。在 22120,我们认为这是不必要的 (而且在任何情况下都不应该执行)。我们认为这类似于砍掉一尊罗马雕像的手臂,把它装进一个展示和安全展示箱。太荒谬了!网络可能是一种更“柔韧” 的媒介,但这并不意味着我们不应该尊重它固有的内容。

Why is changing the internal structure of resources so bad?

为什么改变资源的内部结构如此糟糕?

In our view, the internal structure of the resource as presented, is the cannon. Internal structure is not just substitutable "presentation" - no, in fact it encodes vital semantic information such as hyperlink relationships, source choices, and the "strokes" of the resource author as they create their content, even if it's mediated through a web server or web framework.

在我们看来,所提出的资源的内部结构,是大炮。内部结构不仅仅是可替代的 “表示”——不,事实上它编码了重要的语义信息,比如超链接关系、源代码选择,以及资源作者在创建内容时的 “笔画” ,即使这些内容是通过网络服务器或网络框架传播的。

Why else is 22120 the obvious and natural choice?

为什么 22120 是显而易见的自然选择?

22120 also archives resources exactly as they are sent to the browser. It runs connected to a browser, and so is able to access the full-scope of resources (with, currently, the exception of video, audio and websockets, for now) in their highest fidelity, without modification, that the browser receives and is able to archive them in the exact format presented to the user. Many resources undergo presentational and processing changes before they are presented to the user. This is the ubiquitous, "web app", where client-side scripting enabled by JavaScript, creates resources and resource views on the fly. These sorts of "hyper resources" or "realtime" or "client side" resources, prevalent in SPAs, are not able to be archived, at least not utilizing the normal archive flow, within traditional wget-based archiving tools.

22120 也按照发送到浏览器的方式对资源进行归档。它连接到一个浏览器,因此能够以最高保真度访问浏览器接收的全部资源 (目前除了视频、音频和 websockets) ,而不需要修改,并能够以提供给用户的确切格式存档这些资源。许多资源在呈现给用户之前要经历表示和处理更改。这是一个无处不在的“ web 应用程序” ,由 JavaScript 启用的客户端脚本可以动态地创建资源和资源视图。这些类型的“超资源” 或“实时”或 “客户端” 资源,在传统的基于 wget 的归档工具中普遍存在,不能归档,至少不能利用正常的归档流。

In short, the web is an online medium, and it should be archived and presented in the same fashion. 22120 archives content exactly as it is received and presented by a browser, and it also replays that content exactly as if the resource were being taken from online. Yes, it requires a browser for this exercise, but that browser need not be connected to the internet. It is only natural that viewing a web resource requires the web browser. And because of 22120 the browser doesn't know the difference! Resources presented to the browser form a remote web site, and resources given to the browser by 22120, are seen by the browser as exactly the same. This ensures that the people viewing the archive are also not let down and are given the change to have the exact same experience as if they were viewing the resource online.

简而言之,网络是一种在线媒介,它应该以同样的方式存档和呈现。22/120 档案内容与浏览器接收和呈现的内容完全一样,而且它还会精确地重放内容,就像资源是从网上获取的一样。是的,这需要一个浏览器,但是这个浏览器不需要连接到互联网。浏览网页资源需要使用网页浏览器,这是很自然的。而且因为 22120 浏览器不知道其中的区别!在浏览器上显示的资源形成了一个远程网站,在 22120 年前显示给浏览器的资源被浏览器看作是完全一样的。这样可以确保浏览档案的人不会失望,并且可以获得与在线浏览资源完全相同的体验。

Top

页首

How it works 它是如何工作的

Uses DevTools protocol to intercept all requests, and caches responses against a key made of (METHOD and URL) onto disk. It also maintains an in memory set of keys so it knows what it has on disk.

使用 DevTools 协议拦截所有请求,并将响应缓存到磁盘上。它还在内存中维护一组密钥,以便知道它在磁盘上有什么。

Top

页首

FAQ 常见问题

Do I need to download something? 我需要下载什么吗?

Yes. But....If you like 22120, you might love the clientless hosted version coming in future. You'll be able to build your archives online from any device, without any download, then download the archive to run on any desktop. You'll need to sign up to use it, but you can jump the queue and sign up today.

是的。但是... 如果你喜欢 22/120,你可能会喜欢未来的无客户端托管版本。你可以通过任何设备在线建立档案,不需要任何下载,然后下载档案到任何桌面上运行。你需要注册才能使用它,但是你可以插队,今天就注册。

Can I use this with a browser that's not Chrome-based? 我可以在一个不是基于 chrome 的浏览器上使用它吗?

No.

没有。

But...see #57. Just want to set some expectations, this is only an investigation and considering it, it might not ever get done. But, your voices made a difference, as I wasn't even considering it before.

但是... 看第 57 条。只是想设定一些期望,这只是一个调查,考虑到它,它可能永远不会完成。但是,你们的声音改变了一切,因为我之前都没有考虑过这个问题。

Top

页首

How does this interact with Ad blockers? 这是如何与广告拦截者相互作用的?

Interacts just fine. The things ad blockers stop will not be archived.

广告拦截者阻止的事情不会被存档。

Top

页首

How secure is running chrome with remote debugging port open? 在远程调试端口打开的情况下运行 chrome 有多安全?

Seems pretty secure. It's not exposed to the public internet, and pages you load that tried to use it cannot use the protocol for anything (except to open a new tab, which they can do anyway). It seems there's a potential risk from malicious browser extensions, but we'd need to confirm that and if that's so, work out blocks. See this useful security related post for some info.

看起来很安全。它不会暴露在公共互联网上,而且你加载的试图使用它的页面不能使用该协议做任何事情 (除了打开一个新的标签页,他们无论如何都可以这样做)。恶意浏览器扩展似乎存在潜在的风险,但我们需要确认这一点,如果确实如此,就要解决障碍。查看这个有用的安全相关的文章获得一些信息。

Top

页首

Is this free? 这是免费的吗?

Yes this is totally free to download and use. It's also open source (under AGPL-3.0) so do what you want with it. For more information about licensing, see the license section.

是的,这是完全免费下载和使用。它也是开源的 (在 AGPL-3.0 之下) ,所以你可以使用它做你想做的事情。有关许可的详细信息,请参阅许可部分。

Top

页首

What if it can't find my chrome? 如果它找不到我的铬合金怎么办?

See this useful issue.

看看这个有用的问题。

Top

页首

What's the roadmap? 路线图是什么?

Top

页首

What about streaming content? 那么流媒体内容呢?

The following are probably hard (and I haven't thought much about):

下面这些可能很难 (我还没有想太多) :

Probably some way to do this tho.

也许有办法做到这一点。

Top

页首

Can I black list domains to not archive them? 我可以黑名单域名不存档吗?

Yes! Put any domains into 22120-arc/no.json*, eg:

是的! 将任何域放入 22120-arc/no. json * ,例如:

[
  "*.horribleplantations.com",
  "*.cactusfernfurniture.com",
  "*.gustymeadows.com",
  "*.nytimes.com",
  "*.cnn.co?"
]

Will not cache any resource with a host matching those. Wildcards:

不会缓存任何主机匹配的资源。通配符:

*Note: the no file is per-archive. 22120-arc is the archive root of a single archive, and by defualt it is placed in your home directory. But you can change the parent directory for 22120-arc to have multiple archvies, and each archive requires its own no file, if you want a blacklist in that archive.

* 注意: 无文件是按存档计算的。22120-arc 是单个归档文件的归档根目录,通过 defualt 它被放置在主目录中。但是您可以将 22120-arc 的父目录更改为具有多个 archvie,并且如果您希望在该归档中有一个黑名单,那么每个归档文件都需要自己的 no 文件。

Top

页首

Is there a DEBUG mode for troubleshooting? 是否有用于故障排除的 DEBUG 模式?

Yes, just make sure you set an environment variable called DEBUG_22120 to anything non empty.

是的,只要确保你把一个叫做 DEBUG 22120 的环境变量设置成非空的就可以了。

So for example in posix systems:

例如在 posix 系统中:

export DEBUG_22120=True

Top

页首

Can I version the archive? 我可以对存档文件进行版本化吗?

Yes! But you need to use git for versioning. Just initiate a git repo in your archive repository. And when you want to save a snapshot, make a new git commit.

太好了!但是你需要使用 git 来进行版本控制。只需在存档库中启动一个 git repo 即可。当你想要保存一个快照时,做一个新的 git 提交。

Top

页首

Can I change the archive path? 我可以改变存档路径吗?

Yes, there's a control for changing the archive path in the control page: http://localhost:22120

是的,在控件页面中有一个更改存档路径的控件: http://localhost:22120

Top

页首

Can I change this other thing? 我可以改变另一件事吗?

There's a few command line arguments. You'll see the format printed as the first printed line when you start the program.

有一些命令行参数。在启动程序时,您将看到格式作为第一个打印行打印出来。

For other things you can examine the source code.

对于其他事情,您可以检查源代码。

Top

页首 https://github.com/c9fe/22120