ustclug / discussions

Issue Tracker for USTC LUG
47 stars 4 forks source link

更新Julia mirror #311

Closed Roger-luo closed 3 years ago

Roger-luo commented 4 years ago

Julia从1.4开始会采用一个新的package server,目前它配套的docker镜像已经全部打包好了,这里面包括了完整的服务器前端etc. https://github.com/staticfloat/PkgServerS3Mirror 只需要部署下应该就可以工作了

cc: @staticfloat

zhsj commented 4 years ago

不熟悉julia的情况,但是我看到你给的repo里,是一些nginx反代的配置,这在ustclug mirrors服务器上不会使用的。反代的话,我们有统一的nginx配置(如 mirrors 首页上列出的一系列“反向代理列表”)。

zhsj commented 4 years ago

简单看了下新的 https://github.com/JuliaPackaging/PkgServer.jl ,我觉得这不是我们会使用的架构。我们更偏向原来的 https://github.com/sunoru/julia-mirror 的方式。

staticfloat commented 4 years ago

My Chinese is not good enough to reply in Chinese, but I can explain a bit more in English what's going on here:

The most important thing here is the PkgServer.jl; this will become the new, default, way that packages and artifacts are served to Julia 1.5+. (It will be opt-in in Julia 1.4, and default in Julia 1.5+). It has many advantages over the old system, where all packages are fetched many different servers. At the moment, all it does is serve package and artifact full versions similarly to how you'd get them off of GitHub, but we have many enhancements in the works such as intelligent diff bundles and such, that will minimize the amount of downloading clients need to do.

A PkgServer.jl deployment automatically caches objects locally, so it serves as a kind of edge-cache. We are planning on deploying them in multiple places worldwide, but since cloud hosting in China is more complex than in other parts of the world, we're not quite able to deploy there as we'd like to. Ideally, we will have a PkgServer in many different parts of the world, providing high-speed, cached package versions for all users, and the main pkg.julialang.org endpoint will forward users to more localized versions based on their source IP.

The nginx configuration in PkgServer.jl is simply an HTTPS terminator. You can ignore that, it doesn't matter. The PkgServerS3Mirror allows mirroring of an s3 bucket, for downloading Julia itself (instead of packages or artifacts). Since you already have the julia-mirror, that's not so necessary, but it might be simpler than the mirror_julia.py script that exists (since it's all automatic, with no need for configuration or running it ahead-of-time)

zhsj commented 4 years ago

A PkgServer.jl deployment automatically caches objects locally, so it serves as a kind of edge-cache.

I understand the intention behind to develop PkgServer.jl. It helps you to start a standalone server quickly. But that's not what we do on mirrors.ustc, which is a shared server for many OSS projects. What we preferred(however not hard limit..) is a cron-job like script to sync upstream frequently. We don't like a separate daemon to run on our server. And we don't prefer cache/proxy server, since it doesn't work well as we observed. (we do have proxy services...)

We are planning on deploying them in multiple places worldwide, but since cloud hosting in China is more complex than in other parts of the world, we're not quite able to deploy there as we'd like to. Ideally, we will have a PkgServer in many different parts of the world, providing high-speed, cached package versions for all users, and the main pkg.julialang.org endpoint will forward users to more localized versions based on their source IP.

If you find it's hard to deploy in mainland China. I suggest you to deploy at Hong Kong. Many cloud providers have Data-center in Hong Kong(like Google Cloud, asia-east2 region). And the network speed to HK pop is sufficient for most users in China.

johnnychen94 commented 4 years ago

如果要抛弃PkgServer的特性的话,搭建一个静态服务器也是可以做到的,目前的话有一个比较粗糙的脚本来下载所需要的所有资源,这种方式的话可以设置成cron,将下载下来的东西存好之后大概是这个样子

.
└── julia
    ├── artifact
    ├── package
    ├── registries
    ├── registry
    └── releases

然后假如说提供出来的是http://mirrors.ustc.edu.cn/julia/的话,在用户端也是可以使用镜像的(目前会有警告,https://github.com/JuliaLang/Pkg.jl/pull/1671 )

JULIA_PKG_SERVER=https://mirrors.ustc.edu.cn/julia/ julia

只是这个方案可能后期会像pypi一样迅速膨胀(目前已经有100G左右的数据了)

这种方案能接受么?如果能的话,我可以花一些时间把这个脚本打磨一下。

zhsj commented 4 years ago

只是这个方案可能后期会像pypi一样迅速膨胀(目前已经有100G左右的数据了)

我觉得 100G 并不大。相比 pypi 几T的大小来说。。。但是后面再增长的话,可能会重新考虑,像 pypi 这样。

反代并不是不能接受,我前面也提到我们有反代的服务,比如 ubuntu ppa, npm, cargo 等。我们反代的服务器在日本,国内链接速度有时候并不好(至少我家里的网经常连不上)。所以并不会提升多少体验(最多只是从连不上变成能连上)。

这种方案能接受么?如果能的话,我可以花一些时间把这个脚本打磨一下。

zhsj commented 4 years ago

我觉得从julia社区角度出发的话,套一个cloudflare cdn是最方便的(因为cloudflare国内速度还可以,比fastly这些好非常多);如果能让国内的某个成员注册一个域名,并备案的话,套一个百度云加速(即cloudflare国内节点)会更方便。。

johnnychen94 commented 4 years ago

关于julia-mirror有一个问题是julia版本长期没有更新了,因为它需要手动更新releaseinfo.json, 我这里有一个自动一点的版本jill.py,它并不完全替代julia-mirror,只会把所有的julia 1.0之后的版本给下载下来. 如果可以的话能不能用这个来更新一下 http://mirrors.ustc.edu.cn/julia/releases/ ? 放在同一个cron里应该就好了

默认的文件的存储格式和julia-mirror是一致的:jill mirror <outpath>

目前的话是6.2G,增长应该会比较慢

Roger-luo commented 4 years ago

我们现在有备案的域名 juliacn.com ,但是套一个百度云是什么情况?

zhsj commented 4 years ago

我们现在有备案的域名 juliacn.com ,但是套一个百度云是什么情况?

试一下这个?https://su.baidu.com/ (发现好久不关注,百度云加速免费版限制每日50G流量了,以前还拿它给mirrors.ustc分担流量来着。。)

zhsj commented 4 years ago

关于julia-mirror有一个问题是julia版本长期没有更新了,因为它需要手动更新releaseinfo.json

你指缺少 1.3.1 和(v1.4.0-rc1)?

johnnychen94 commented 4 years ago

关于julia-mirror有一个问题是julia版本长期没有更新了,因为它需要手动更新releaseinfo.json

你指缺少 1.3.1 和(v1.4.0-rc1)?

对的,虽然可以通过手动更新releaseinfo.json来做到,但是始终显得有一些麻烦... 另外就是这里把早期版本给删除了,这个对于静态存储来说就有点不是很可靠... 虽然也不是什么大问题...

sunoru commented 4 years ago

嗯关于 julia-mirror 里 releaseinfo.json 的问题,其实用 scripts/make_releaseinfo.py 就可以自己更新。我确实该让它自动更新(或者至少提醒我去手动更新)……

johnnychen94 commented 4 years ago

我这边大概完善了镜像所需要的工具,内网测试是正常使用的,不过原先的 julia-mirror 需要退休了 -- 它占用了registries这个名称

Julia 二进制/安装程序

jill.py 是一个一键安装julia的工具,它同时提供了 julia-mirror 里下载julia二进制的功能,重点在于它会自动发现新版本.....

默认的配置是跟现有的一致,所以只需要 jill mirror /path/to/mirrors/julia/releases 就可以了

https://github.com/johnnychen94/jill.py#for-who-are-interested-in-setting-up-a-new-release-mirror

Julia 包安装所需要的资源

前面已经大概说明了最好的方式是直接搭建 https://github.com/JuliaPackaging/PkgServer.jl ,但在拒绝反代或类似后台常驻的工具的前提下,可以使用 julia 包 https://github.com/johnnychen94/StorageServer.jl#mirror 来拉取所需的全部资源

示例脚本 里面涉及到三个文件夹:

配置Julia来使用镜像:设置环境变量JULIA_PKG_SERVER="https://mirrors.ustc.edu.cn/julia"即可,只要versioninfo()里识别到这一项就说明成功了

julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6800K CPU @ 3.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, broadwell)
Environment:
  JULIA_PKG_SERVER = https://mirrors.ustc.edu.cn/julia

初次构造在网络足够好的情况下大概需要8小时左右,后续增量更新(4线程)大概需要8分钟。上游注册表是12分钟更新一次,但降低拉取的频率到1或者2hr应该不会有太大影响。

空间占用:

root@storage:/mnt/mirrors/julia# du -sh *
221G  artifact
22G   package
4.0K  registries
4.9M  registry
8.6G  releases 

artifact的磁盘占用比较大,后续增量也会比较显著,这是因为它将所有平台需要的二进制依赖全部保存下来了,例如CUDA_jll每个新版本都会导致artifact增加800M左右

Hmmm.. 两个月前说的时候是100G,现在已经膨胀到250G了...

根据pkg/storage server协议的设计,原则上是应该把所有内容都拉取进来的,但如果后期空间占用实在太大,我可以增加一个“只保留最新的100个版本”之类的功能来释放空间...但这个肯定会多多少少影响到用户体验...


这两个工具都可以配置到cron.d里面,这个你们应该比较熟悉了...

如果描述的有不明白或者工具有不完善的地方请提出来我再改进

johnnychen94 commented 4 years ago

@staticfloat Out of curiosity, is it possible to expose a rsync protocol of pkg.julialang.org to the public? That would significantly simplify the setup of the mirror sites. s3 bucket, if accessible, would be helpful, too.

staticfloat commented 4 years ago

I'm a little wary of allowing non-HTTP methods, as we have synchronization locks and whatnot within the HTTP server to ensure that, even while we're updating files, you never get a half-baked file. If we provided alternative methods (such as rsync) it's possible the rsync process can get a half-written file. That's solvable, but why do you want to use rsync? You may end up pulling files that are no longer reachable from the registry and whatnot that we want to keep around for paranoia's sake, but which are most likely not needed by your local cache. I would think it would be better for you to just walk the registry and download everything that is reachable (similar to how the gen_static.jl script works).

johnnychen94 commented 4 years ago

bump @zhsj

johnnychen94 commented 4 years ago

更新:

基于上面提到的 StorageServer.jl 的北外镜像站已经搭建起来了:https://mirrors.bfsu.edu.cn/help/julia/

johnnychen94 commented 4 years ago

@zhsj Any plans to update this mirror?

With https://github.com/johnnychen94/StorageMirrorServer.jl this should be pretty easy to set up. The only issue is that network connection to upstream storage server might not be that stable and fast from mainland China.

Currently, BFSU, TUNA, and SJTUG mirrors are built with this tool.

FWIW, StorageMirrorServer does not provide Julia binary releases http://mirrors.ustc.edu.cn/julia/releases/, which could be easily set up with aws s3 sync.

taoky commented 4 years ago

jill.py 是一个一键安装julia的工具,它同时提供了 julia-mirror 里下载julia二进制的功能,重点在于它会自动发现新版本.....

默认的配置是跟现有的一致,所以只需要 jill mirror /path/to/mirrors/julia/releases 就可以了

我刚刚在本地测试使用 jill mirror <path> 同步 julia 的 releases,发现同步的目录结构和 https://mirrors.bfsu.edu.cn/julia-releases/ 中的差别比较大。

bash-4.4# tree                                                                                                           
.
└── releases
    ├── v0.6
    │   ├── julia-0.6.3-freebsd-x86_64.tar.gz
    │   └── julia-0.6.3-freebsd-x86_64.tar.gz.asc
    ├── v0.7
    │   ├── julia-0.7.0-freebsd-x86_64.tar.gz
    │   └── julia-0.7.0-freebsd-x86_64.tar.gz.asc
    ├── v1.0
    │   ├── julia-1.0.0-freebsd-x86_64.tar.gz
    │   ├── julia-1.0.0-freebsd-x86_64.tar.gz.asc
    │   ├── julia-1.0.1-freebsd-x86_64.tar.gz
    │   ├── julia-1.0.1-freebsd-x86_64.tar.gz.asc
    │   ├── julia-1.0.2-freebsd-x86_64.tar.gz
    │   ├── julia-1.0.2-freebsd-x86_64.tar.gz.asc
    │   ├── julia-1.0.3-freebsd-x86_64.tar.gz
    │   ├── julia-1.0.3-freebsd-x86_64.tar.gz.asc
    │   ├── julia-1.0.4-freebsd-x86_64.tar.gz
    │   ├── julia-1.0.4-freebsd-x86_64.tar.gz.asc
    │   ├── julia-1.0.5-freebsd-x86_64.tar.gz
    │   └── julia-1.0.5-freebsd-x86_64.tar.gz.asc
    ├── v1.1
    │   ├── julia-1.1.0-freebsd-x86_64.tar.gz
    │   ├── julia-1.1.0-freebsd-x86_64.tar.gz.asc
    │   ├── julia-1.1.1-freebsd-x86_64.tar.gz
    │   └── julia-1.1.1-freebsd-x86_64.tar.gz.asc
(以下省略)

这是预期的吗?

johnnychen94 commented 4 years ago

jill mirror 的同步与当前 PkgMirror 给出的结构一致 http://mirrors.ustc.edu.cn/julia/releases/

如果需要与 BFSU 的 julia-releases 一致的话,需要利用 aws s3 sync 来做。我不太确定这个应该怎么操作,大概是类似于这样

aws s3 sync s3://julialang2 /mnt/mirrors/julia/julialang2

这个s3 bucket 是 us-east-1 这个区域的

尽量还是用 aws s3 sync 来做吧,jill mirror 这个功能我可能后期会考虑删掉(比较累赘... 当时写的时候不知道 aws s3 sync 这个工具...)

johnnychen94 commented 4 years ago

目前TUNA和SJTUG的同步方式是:

taoky commented 4 years ago

https://mirrors.ustc.edu.cn/julia/ 使用 StorageMirrorServer.jl 的镜像(正在初次同步中,可能还需要花掉一些时间才能正式使用)

https://mirrors.ustc.edu.cn/julia-legacy/ 原先的 Julia 旧镜像

https://mirrors.ustc.edu.cn/julia-releases/ Releases 目录(同步自 s3://julialang2

johnnychen94 commented 4 years ago

中文社区在国内目前有几台交给 Julia 官方统一维护的 pkgserver(缓存服务器)即 https://pkg.julialang.org,在这里征求一下你们的意见能否将USTC也添加到上游中。

大概的情况是:

CRef: https://github.com/tuna/issues/issues/878

johnnychen94 commented 4 years ago

https://mirrors.ustc.edu.cn/julia-legacy/ 原先的 Julia 旧镜像

PkgMirrors 硬编码了镜像URL,而 PkgMirrors 应该已经停止维护了,所以大概可以直接删除。

cc: @sunoru

johnnychen94 commented 4 years ago

观察了一下似乎是每天同步一次,可以将 julia 这个的同步的频率稍微调高一些么,比如说2-4小时

taoky commented 4 years ago

观察了一下似乎是每天同步一次,可以将 julia 这个的同步的频率稍微调高一些么,比如说2-4小时

已经调整到每 4 小时同步一次了。

taoky commented 4 years ago

中文社区在国内目前有几台交给 Julia 官方统一维护的 pkgserver(缓存服务器)即 https://pkg.julialang.org,在这里征求一下你们的意见能否将USTC也添加到上游中。

大概的情况是:

  • 所有国内的 Julia 用户在不配置镜像的情况下会默认使用这一套缓存服务器
  • 不需要 USTC 这边作出其他额外的维护性工作,也不需要可靠性保证。只是为了加速国内一般用户的访问和下载速度。
  • Pkgserver 相当于作了代理,所以镜像站这边收集到的用户数据(如果有这个需求的话)可能会降低。

CRef: tuna/issues#878

嗯,没问题。

johnnychen94 commented 4 years ago

除了 julia-legacy 到时候需要移除以外这个 issue 应该没有什么其他要做的工作了。

sunoru commented 4 years ago

辛苦了辛苦了

(抱歉回复晚了——

嗯既然有了官方的 pkgserver 和新的包管理/存储协议,StorageMirrorServer.jl 看上去很棒,PkgMirrors.jl 确实可以停止维护了。

taoky commented 3 years ago

除了 julia-legacy 到时候需要移除以外这个 issue 应该没有什么其他要做的工作了。

Julia 1.6 LTS 已正式发布,julia-legacy 镜像已删除。