Closed Kapeli closed 1 year ago
How about just sharing the database dump, like a backup.torrent or whatever. Crates.io for sure have a backup, this way some users will co-host it for free.
That's not an acceptable solution - note that @Kapeli isn't just asking for tossing the database over the wall, they specific want to integrate docs.rs into a offline documentation viewer so that users can download the documentation for specific crates as needed. The docset format isn't exclusive to Dash, either - other projects such as Zeal also make use of it.
One note: I'm not asking for docs.rs to generate Dash docsets. I'm asking for it to provide downloadable docs.
I'd really love to see this done. I use Dash for everything else and it's really annoying dealing with third-party Rust crate docs since I can't view them anywhere outside a web browser.
I am really looking forward this integration.
Can't wait for both integration and Rust syntax coloring support for Dash snippets :)
is there any progress or any new plan?
It'd also be nice to get the nightly rustc docs without having to build the rustc. Recursive wget is a(n inefficient) workaround, I suppose.
https://github.com/Robzz/cargo-docset was recently released, maybe that's enough for some people here or maybe the some of the code can be reused and integrated into docs.rs somehow
Any update on this?
It's not too practical to generate a downloadable archive of a crate's documentation, as each file is stored individually on S3.
We'd need to fetch all the files individually and generate an archive of that on the fly, which is not practical for large crates. Preparing an archive at build time and storing it separately would increase our storage costs, and due to the unbounded nature of docs.rs we should try avoiding that.
If y'all have better implementation ideas I'd love to read them.
Could we prepare an archive at build time but only for crates that are opted in to this using some notion of "important to the community"? For example I'd love to see docs.rs provide a docset for tokei that Dash can keep automatically up-to-date. I don't know who'd provide that curation though. There could be some way to nominate crates and leave it up to the docs.rs maintainers to approve it, or maybe it could be based on traffic to a particular crate's documentation.
Could the archive be generated only when requested and have a fixed-size cache where older archives get removed?
Would the archives really be that big though? Docs are generally just text, which compresses very well. You could have a separate archive for the common resources (CSS, images, fonts and so on) and then the docs archives would just be compressed HTML files.
Could the archive be generated only when requested and have a fixed-size cache where older archives get removed?
That's not really feasible, as some crates (like stm32f0
) have ~200k HTML files in them. They're all stored on S3, and just listing them took awscli 2 minutes and 43 seconds from the docs.rs server.
Would the archives really be that big though? Docs are generally just text, which compresses very well. You could have a separate archive for the common resources (CSS, images, fonts and so on) and then the docs archives would just be compressed HTML files.
Resources are already deduplicated on S3, and all files will be compressed soontm. Once we do that storing the prebuilt archives will double our storage requirements. Today we can afford that, but thinking long term we'll want to avoid using too much storage.
just listing them took awscli 2 minutes and 43 seconds from the docs.rs server
Taking a long time is fine. For API access you can return a message saying the docs archive isn't ready yet and to try again later, for users trying to download the docs from their browser, show a page saying the same thing, maybe a bit nicer with automatic refresh and so on. With a big enough cache size, you could optimise both CPU and disk space needs.
It'd also be nice to get the nightly rustc docs without having to build the rustc
@nhynes this is out of scope for docs.rs, we only build user documentation. I'm not sure the right place to open a new issue, maybe https://github.com/rust-lang/www.rust-lang.org/issues ?
We discussed this internally and this probably won't see action at least until Rust All Hands in March.
Personally, I would like to see #379 implemented and #532 merged before we make any decisions, which would let us see how much storage we'll be using in the future.
Not directly related, but it'd be great to turn docs.rs into an offline-first PWA (Progressive Web App). So the user would still be able to browse the docs they have already visited before even when offline, without having to use a separate website or app.
The same could be done for doc.rust-lang.org
That can be achieved with cargo doc
to build local crates and rustup doc
for the book, std, and everything else on doc.rustlang.org
Unfortunately, that's not the best UX. As a user, I want to be able to keep using the familiar links I've already visited before from my browser history, and have them still work while offline. Since, let's be honest, we usually discover things using the search engine, not start by browsing through / searching inside the docs.
Another alternative is a browser extension to redirect online version -> offline version, similar to what the IPFS Companion extension does.
For example: https://doc.rust-lang.org/std/sync/struct.RwLock.html -> file:///home/teohhanhui/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/share/doc/rust/html/std/sync/struct.RwLock.html
I opened https://github.com/rust-lang/docs.rs/issues/845 since this is not related to downloadable docs.
As a long-time Dash user, I would like to reawaken this thread. I've recently come to working in Rust after three years of working in Elixir. The Elixir community managed to solve exactly this problem. The integration with Dash and third-party libraries is IMHO part of the delight of working in that language environment.
I recently completed the Rust 2021 survey and this was my #1 complaint about transitioning into Rust work. I hope the Rust community will take this more seriously.
Don't worry, we haven't forgotten about it. This is currently blocked on https://github.com/rust-lang/docs.rs/issues/1004; if you're interested in helping out, that would be a good place to start.
Random thought: I know the compression stuff is in the works... What about generating a single-file-per-crate compressed archive anyway/separately and storing that for everything going forward? That would be a relatively simple change to the doc build process, would it not? (e.g. last build step: tar+zstd all of the built files; save archive somewhere, s3?)
Even if the needs of the compression project change and/or evolve, the data would still be there and could be easily converted to whatever format is required - most probably much more easily converted than retrieving millions of individual files from s3.
Doing this would:
Side note:
From experience, even using random access capable compressed tar archives, the size is usually much larger (i.e. worse compression ratios). https://github.com/martinellimarco/t2sz does indeed work well but since it has to make each file a ZSTD block in order to make the archive random seekable, the compression is (essentially) the same as using zstd on each individual file and then making a tar of the individual compressed files.
(One anecdotal example: 48 JSON files each ~4-6MB - total uncompressed size ~260MB. Compressed with zstd -19 --long=31
the tar.zst is ~1.9MB. Compressed with t2sz (indexed tar) it's ~16MB. That's a pretty big difference.)
Given that https://github.com/rust-lang/docs.rs/pull/1342 is almost done and would help with things besides downloadable docs, I don't think it makes sense to store both archived and unarchived files for the same crates. Note that we'd need to rebuild old crates (https://github.com/rust-lang/docs.rs/issues/464) no matter what for them to have downloadable docs, and that isn't feasible if we have to reupload individual files.
From experience, even using random access capable compressed tar archives, the size is usually much larger (i.e. worse compression ratios).
See https://github.com/rust-lang/docs.rs/pull/1342#issuecomment-861520392, having the smallest possible size is not a goal at the moment.
Since #1342 is progressing slowly, but continuously, I started thinking about downloadable docs again.
From what I see and understand from the code (please correct me if I'm wrong):
--static-root-path=/
, the references to CSS/JS files would be absolute paths starting with /
, not relative paths, the default. toolchain-shared
or unversioned-shared
files separately, we exclude them from the doc-output, so they will be missing in the current archive. So, the current archives probably won't be usable for downloadable/offline docs, without
href
in the HTML files from absolute URL paths to relative local paths (which is not possible synchronously). So to have actually useable downloadable docs, wouldn't we have to run a second, clean, doc-build we would put into another archive for download? To me that does seem to be less effort and more clean than rewriting HTML and re-compressing/merging archives, does it?
So to have actually useable downloadable docs, wouldn't we have to run a second, clean, doc-build we would put into another archive for download? To me that does seem to be less effort and more clean than rewriting HTML and re-compressing/merging archives, does it?
The problem is that it uses the twice the CPU and storage, which doesn't seem like a great use of resources when all we want to do is change some of the links to point elsewhere.
Are changes needed related to intra-doc-links?
Hmm, most likely yes. @Kapeli how would you expect relative links to other crates to work? Currently they'd point to https://docs.rs, which seems to defeat the point of downloadable docs.
I think it might also be fine to make this the responsibility of whatever tool is packaging the docs into docsets; it's not that hard to parse HTML, there are lots of libraries for it.
since we serve toolchain-shared or unversioned-shared files separately, we exclude them from the doc-output, so they will be missing in the current archive.
It doesn't seem too hard to add them into the archive - there are many fewer shared files than per-crate files, it seems ok to do a bit of preprocessing for them on the docs.rs server. Alternatively, we could expose an API for "all shared files" and make it the responsibility of the client to combine them properly.
rewriting all the CSS/JS href in the HTML files from absolute URL paths to relative local paths (which is not possible synchronously).
What do you mean by synchronously? I don't think it would be unreasonable to do this at the time we download the archive.
So to have actually useable downloadable docs, wouldn't we have to run a second, clean, doc-build we would put into another archive for download? To me that does seem to be less effort and more clean than rewriting HTML and re-compressing/merging archives, does it?
The problem is that it uses the twice the CPU and storage, which doesn't seem like a great use of resources when all we want to do is change some of the links to point elsewhere.
It that's the only thing, probably true. But for all the logic specific to docs.rs someone would have to maintain the reverse-logic, right? ( for example changing the links to be relative, re-adding the shared files)
Are changes needed related to intra-doc-links?
Hmm, most likely yes. @Kapeli how would you expect relative links to other crates to work? Currently they'd point to https://docs.rs, which seems to defeat the point of downloadable docs.
I think it might also be fine to make this the responsibility of whatever tool is packaging the docs into docsets; it's not that hard to parse HTML, there are lots of libraries for it.
If the only use-case are offline docs, perhaps that could work, of course when we make more changes in how docs are generated for docs.rs, this additional revert-logic has to be adapted too, right?
If the use-case is also downloadable docs for everyone, that would be a high burden for everyone.
since we serve toolchain-shared or unversioned-shared files separately, we exclude them from the doc-output, so they will be missing in the current archive.
It doesn't seem too hard to add them into the archive - there are many fewer shared files than per-crate files, it seems ok to do a bit of preprocessing for them on the docs.rs server. Alternatively, we could expose an API for "all shared files" and make it the responsibility of the client to combine them properly.
👍
rewriting all the CSS/JS href in the HTML files from absolute URL paths to relative local paths (which is not possible synchronously).
What do you mean by synchronously? I don't think it would be unreasonable to do this at the time we download the archive.
Exactly :) Thinking about merging the archive I was initially thinking about a synchronous approach.
If the use-case is also downloadable docs for everyone, that would be a high burden for everyone.
Hmm, I don't understand the distinction you're drawing between "offline docs" and "downloadable docs for everyone". Are you expecting people to run curl docs.rs/some/api
and use the archive without further changes? I don't really know why you would do that - for me personally, running cargo doc --no-deps
would be easier than messing with curl.
It that's the only thing, probably true. But for all the logic specific to docs.rs someone would have to maintain the reverse-logic, right? ( for example changing the links to be relative, re-adding the shared files)
Another alternative I suppose is to still only upload one archive when the crate is first built, but when rewriting/adding shared files the first time it's downloaded, also reupload that cached archive so it's faster next time. That saves space for crates whose docs never get downloaded, without making it too slow to rewrite things, and also lets us delete the cache after a while if we don't need it. On the other hand, it seems complicated (since we'd have to deal with multiple requests at a time trying to download the same archive).
It that's the only thing, probably true. But for all the logic specific to docs.rs someone would have to maintain the reverse-logic, right? ( for example changing the links to be relative, re-adding the shared files)
Another alternative I suppose is to still only upload one archive when the crate is first built, but when rewriting/adding shared files the first time it's downloaded, also reupload that cached archive so it's faster next time. That saves space for crates whose docs never get downloaded, without making it too slow to rewrite things, and also lets us delete the cache after a while if we don't need it. On the other hand, it seems complicated (since we'd have to deal with multiple requests at a time trying to download the same archive).
I'll probably give the general rewrite-approach a try and then we can decide on when we do it. Right now rewriting and re-compressing feels like too much load to do in a request, especially for bigger crates, on top of handling parallel requests to the same release. On the other side we don't know how many of the crates are actually requested, so how much CPU would be wasted when we rewrite every release.
If the use-case is also downloadable docs for everyone, that would be a high burden for everyone.
Hmm, I don't understand the distinction you're drawing between "offline docs" and "downloadable docs for everyone". Are you expecting people to run
curl docs.rs/some/api
and use the archive without further changes? I don't really know why you would do that - for me personally, runningcargo doc --no-deps
would be easier than messing with curl.
I see your point, ok
Hmm, most likely yes. @Kapeli how would you expect relative links to other crates to work? Currently they'd point to https://docs.rs, which seems to defeat the point of downloadable docs. I think it might also be fine to make this the responsibility of whatever tool is packaging the docs into docsets; it's not that hard to parse HTML, there are lots of libraries for it.
Coming back to this, I'm not sure how the other official integrations look like with @Kapeli .
Since there is a custom search-index and table of contents in the docsets, there probably will be an additional processing step in any case. If that includes HTML rewrites, the JS/CSS paths could be fixed there, though that doesn't feel like a thing that I would officially serve from the pages as "downloadable docs" since it's broken on its own :)
for me personally, running
cargo doc --no-deps
would be easier than messing with curl.
And that works great if you're on a computer with a GUI (as opposed to SSHed into a terminal w/o X-forwarding) and have a reasonably powerful computer (and the space to build the docs)
For me, I often use emacs
or neovim
across SSH from my phone (or even locally on my Phone... thank you Termux)... And I recognize that's somewhat atypical (but it sure is convenient!)
One issue I have personally with the entire Rust language documentation ecosystem is that it's entirely based on the premise that you're doing development on a system with a GUI and a web browser. The output is HTML, graphical, and there isn't (AFAIK) even an option to output to something that is text-mode friendly. (Using a text mode web browser is a hack-y solution.) This is in stark contrast to documentation systems for other languages like Sphinx for Python which will happily generate -using built-in builders- HTML (both single page and individual files), Plain Text, LaTeX, EPUB, PDF (through LaTeX), Texinfo (for GNU info
), etc... This is also very different in the Python world because you often don't even need the generated documentation (which is usually made from the docstrings in the source) ... If you're in a Python REPL (regular python
, ipython
, jupyter
, etc.), you can access the documentation quite easily with a simple call to help(numpy.array)
or from the command line: pydoc numpy.array
I recognize that Rust's overall documentation system design is well beyond the scope here... But I do raise it to highlight what appears to be an assumption made in the design of the Rust ecosystem: That developers using Rust are working on (1) powerful systems (2) with GUIs and [without downloadable documentation] (3) are always connected to the internet.
On a related note, as I'm sure you're aware, there are many developers in other-than-the-richest-countries that might not have constant, continuous, and reliable internet. Beyond the (comparatively) luxurious issue of not being able to access the documentation when working on a laptop in an airplane without wifi, there are those who need to download things because they might have to go to a Cafe or other central point for internet access. Assuming someone has internet all the time as a design decision would tend to make it very difficult for people in that situation.
I don't want to come off as overly critical -- not my intention at all. (Maybe more of a "Hey, have you thought about this other aspect....") And I might be totally off here - I don't know... What are your thoughts?
@danieldjewell IMHO this is outside of the context of this issue, since it's about downloadable HTML docs.
I believe what you want can be covered when the json output for rustdoc is stable, at which point much tooling (including docs.rs) can just use the JSON output to generate any format.
Are changes needed related to intra-doc-links?
Hmm, most likely yes. @Kapeli how would you expect relative links to other crates to work? Currently they'd point to https://docs.rs, which seems to defeat the point of downloadable docs.
I'd prefer them to point to https://docs.rs and I can make Dash check if the other crate is installed locally before actually going to https://docs.rs.
I think it might also be fine to make this the responsibility of whatever tool is packaging the docs into docsets; it's not that hard to parse HTML, there are lots of libraries for it.
I prefer it if you just archive what you have now and I'll fix/rewrite/clean any issues I encounter.
since we serve toolchain-shared or unversioned-shared files separately, we exclude them from the doc-output, so they will be missing in the current archive.
It doesn't seem too hard to add them into the archive - there are many fewer shared files than per-crate files, it seems ok to do a bit of preprocessing for them on the docs.rs server. Alternatively, we could expose an API for "all shared files" and make it the responsibility of the client to combine them properly.
@jyn514 so adding these into the archive would be the first step? Then the only thing needed is rewriting the HTML.
I think it might also be fine to make this the responsibility of whatever tool is packaging the docs into docsets; it's not that hard to parse HTML, there are lots of libraries for it.
I prefer it if you just archive what you have now and I'll fix/rewrite/clean any issues I encounter.
That's good to hear @Kapeli !
@jyn514 what's your take on having the archive download in a (probably private) endpoint before having downloadable docs for everyone?
what's your take on having the archive download in a (probably private) endpoint before having downloadable docs for everyone?
What would be the purpose? If you just want a couple example archives to test and make sure they can be converted to docsets, I can grab them off S3 manually.
so adding these into the archive would be the first step?
I prefer it if you just archive what you have now and I'll fix/rewrite/clean any issues I encounter.
I think the archives should work fine as is - they still link to the shared files on docs.rs, the shared files just aren't in the archives themselves. But there's few enough shared files it's possible to download them one by one (around 10 for each nightly toolchain).
Hmm, I wonder if it makes sense to have a separate archive that has all the shared files at once, so you only have to make one request instead of ~15000.
I think it boils down to a use-case question:
/
, so rewriting is needed even when someone would want to use our hosted CSS/JS files. curl
the archive, extract it and open it locally. For the case of Dash/Kapeli I assume that rewriting could be done by them, while fetching the static files is additional logic while already rewriting HTML.
Hmm, I wonder if it makes sense to have a separate archive that has all the shared files at once, so you only have to make one request instead of ~15000.
As I understand the use-case the archives are downloaded and converted one-by-one, not all at once, so putting all shared static into one archive doesn't make much sense.
@jyn514 how did you see the feature / the use-case?
@syphar I was imaging this is only for Dash (or other tools that want to preprocess it). Recall we discussed this above (https://github.com/rust-lang/docs.rs/issues/174#issuecomment-917683422).
Hmm, I wonder if it makes sense to have a separate archive that has all the shared files at once, so you only have to make one request instead of ~15000.
As I understand the use-case the archives are downloaded and converted one-by-one, not all at once, so putting all shared static into one archive doesn't make much sense.
So there are three options here:
I'm ok with either 2 or 3. I don't think 1. makes sense given that Dash is preprocessing the files anyway.
After the discussion in discord we'll start with (2):
(3) having the single archive for static files would mean we have to update the archive every day, and re-upload it with the whole history. Also the performance-improvement would be very small since these static files are cached in the CDN anyways.
or 410
not 410, we plan to backfill the archives at some point
Mentoring instructions: add a new rustdoc_page
route to https://github.com/rust-lang/docs.rs/blob/8e35ec47e4c46bea13fb33ada4d25887de93bdc3/src/web/routes.rs#L10 which matches /:crate/:version.zip
and fetches the relevant file from S3, using storage.get_from_archive
(see the example in https://github.com/rust-lang/docs.rs/blob/8e35ec47e4c46bea13fb33ada4d25887de93bdc3/src/web/rustdoc.rs#L350).
Mentoring instructions: add a new
rustdoc_page
route tohttps://github.com/rust-lang/docs.rs/blob/8e35ec47e4c46bea13fb33ada4d25887de93bdc3/src/web/routes.rs#L10 which matches
/:crate/:version.zip
and fetches the relevant file from S3, usingstorage.get_from_archive
(see the example in
Small correction here: it's actually storage.get
, since we don't want to get something from inside archive, but the archive itself.
Plus, since archives might be gigabytes in size it would be good to stream them to the client, and not loading the whole archive into memory on the server.
In this part we will see which archive to fetch (the rustdoc one).
Small update here after a chat with @pietroalbini and @jyn514
static.docs.rs
or downloads.docs.rs
) latest
or *
in the version we could always redirect to the latest version? (TBD) any updates? this is a trivial issue ongoing for 4 years now...
(3) having the single archive for static files would mean we have to update the archive every day, and re-upload it with the whole history. Also the performance-improvement would be very small since these static files are cached in the CDN anyways.
Is the goal of downloadable docs to improve performance, or to ensure docs are available when offline? I'm assuming the latter.
In that case, it's important for tools that want to download docs to be able to enumerate all the static files that might be needed by a bundle of docs. It's not trivial to enumerate these just by processing HTML, because some are loaded by JS (e.g. search-index).
I think we probably need to start recording a mapping of rustdoc release -> list of static files, and provide that listing as part of the bundle for crate docs built with that release.
Is the goal of downloadable docs to improve performance, or to ensure docs are available when offline? I'm assuming the latter.
yes, the latter. More specifically this issue here is about offline doc readers that have to process the docs anyways to make them usable in their docsets.
In that case, it's important for tools that want to download docs to be able to enumerate all the static files that might be needed by a bundle of docs. It's not trivial to enumerate these just by processing HTML, because some are loaded by JS (e.g. search-index).
Since processing and HTML rewriting is needed anyways right now the idea was to download the missing assets when needed, where needed. The search-index is invocation specific and will be in the archive, while I think the offline doc readers wouldn't use our internal search. But that's up to them.
Ah, I misspoke about search-index. Good catch. But the problem exists for settings.js, settings.css, and search.js. They are loaded at runtime by other JS that uses rustdoc-vars to figure out their paths. Perhaps it's true that these pieces of functionality aren't needed by offline doc readers, but it seems like a potential source of fragility / worrying future bug.
For other, more typical <script>
and <link>
tags: do we know it's definitely the case that Dash and other offline doc readers will process all downloaded files to find such files and predownload them?
I'd like to integrate docs.rs inside Dash.
To achieve this, I need a way to download the docs for a package as HTML files. Please consider supporting this.
edit(@jyn514): see https://github.com/rust-lang/docs.rs/issues/174#issuecomment-926885213 for mentoring instructions.