rust-lang / docs.rs

crates.io documentation generator
https://docs.rs
MIT License
974 stars 194 forks source link

Downloadable docs #174

Closed Kapeli closed 1 year ago

Kapeli commented 6 years ago

I'd like to integrate docs.rs inside Dash.

To achieve this, I need a way to download the docs for a package as HTML files. Please consider supporting this.

edit(@jyn514): see https://github.com/rust-lang/docs.rs/issues/174#issuecomment-926885213 for mentoring instructions.

njskalski commented 6 years ago

How about just sharing the database dump, like a backup.torrent or whatever. Crates.io for sure have a backup, this way some users will co-host it for free.

ketsuban commented 6 years ago

That's not an acceptable solution - note that @Kapeli isn't just asking for tossing the database over the wall, they specific want to integrate docs.rs into a offline documentation viewer so that users can download the documentation for specific crates as needed. The docset format isn't exclusive to Dash, either - other projects such as Zeal also make use of it.

Kapeli commented 6 years ago

One note: I'm not asking for docs.rs to generate Dash docsets. I'm asking for it to provide downloadable docs.

lilyball commented 5 years ago

I'd really love to see this done. I use Dash for everything else and it's really annoying dealing with third-party Rust crate docs since I can't view them anywhere outside a web browser.

reeze commented 5 years ago

I am really looking forward this integration.

dmilith commented 5 years ago

Can't wait for both integration and Rust syntax coloring support for Dash snippets :)

libratiger commented 5 years ago

is there any progress or any new plan?

nhynes commented 5 years ago

It'd also be nice to get the nightly rustc docs without having to build the rustc. Recursive wget is a(n inefficient) workaround, I suppose.

brinsche commented 5 years ago

https://github.com/Robzz/cargo-docset was recently released, maybe that's enough for some people here or maybe the some of the code can be reused and integrated into docs.rs somehow

nanne007 commented 5 years ago

Any update on this?

pietroalbini commented 4 years ago

It's not too practical to generate a downloadable archive of a crate's documentation, as each file is stored individually on S3.

We'd need to fetch all the files individually and generate an archive of that on the fly, which is not practical for large crates. Preparing an archive at build time and storing it separately would increase our storage costs, and due to the unbounded nature of docs.rs we should try avoiding that.

If y'all have better implementation ideas I'd love to read them.

lilyball commented 4 years ago

Could we prepare an archive at build time but only for crates that are opted in to this using some notion of "important to the community"? For example I'd love to see docs.rs provide a docset for tokei that Dash can keep automatically up-to-date. I don't know who'd provide that curation though. There could be some way to nominate crates and leave it up to the docs.rs maintainers to approve it, or maybe it could be based on traffic to a particular crate's documentation.

Kapeli commented 4 years ago

Could the archive be generated only when requested and have a fixed-size cache where older archives get removed?

Would the archives really be that big though? Docs are generally just text, which compresses very well. You could have a separate archive for the common resources (CSS, images, fonts and so on) and then the docs archives would just be compressed HTML files.

pietroalbini commented 4 years ago

Could the archive be generated only when requested and have a fixed-size cache where older archives get removed?

That's not really feasible, as some crates (like stm32f0) have ~200k HTML files in them. They're all stored on S3, and just listing them took awscli 2 minutes and 43 seconds from the docs.rs server.

Would the archives really be that big though? Docs are generally just text, which compresses very well. You could have a separate archive for the common resources (CSS, images, fonts and so on) and then the docs archives would just be compressed HTML files.

Resources are already deduplicated on S3, and all files will be compressed soontm. Once we do that storing the prebuilt archives will double our storage requirements. Today we can afford that, but thinking long term we'll want to avoid using too much storage.

Kapeli commented 4 years ago

just listing them took awscli 2 minutes and 43 seconds from the docs.rs server

Taking a long time is fine. For API access you can return a message saying the docs archive isn't ready yet and to try again later, for users trying to download the docs from their browser, show a page saying the same thing, maybe a bit nicer with automatic refresh and so on. With a big enough cache size, you could optimise both CPU and disk space needs.

jyn514 commented 4 years ago

It'd also be nice to get the nightly rustc docs without having to build the rustc

@nhynes this is out of scope for docs.rs, we only build user documentation. I'm not sure the right place to open a new issue, maybe https://github.com/rust-lang/www.rust-lang.org/issues ?

jyn514 commented 4 years ago

We discussed this internally and this probably won't see action at least until Rust All Hands in March.

Personally, I would like to see #379 implemented and #532 merged before we make any decisions, which would let us see how much storage we'll be using in the future.

teohhanhui commented 4 years ago

Not directly related, but it'd be great to turn docs.rs into an offline-first PWA (Progressive Web App). So the user would still be able to browse the docs they have already visited before even when offline, without having to use a separate website or app.

The same could be done for doc.rust-lang.org

Kixiron commented 4 years ago

That can be achieved with cargo doc to build local crates and rustup doc for the book, std, and everything else on doc.rustlang.org

teohhanhui commented 4 years ago

Unfortunately, that's not the best UX. As a user, I want to be able to keep using the familiar links I've already visited before from my browser history, and have them still work while offline. Since, let's be honest, we usually discover things using the search engine, not start by browsing through / searching inside the docs.

Another alternative is a browser extension to redirect online version -> offline version, similar to what the IPFS Companion extension does.

For example: https://doc.rust-lang.org/std/sync/struct.RwLock.html -> file:///home/teohhanhui/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/share/doc/rust/html/std/sync/struct.RwLock.html

jyn514 commented 4 years ago

I opened https://github.com/rust-lang/docs.rs/issues/845 since this is not related to downloadable docs.

scouten commented 3 years ago

As a long-time Dash user, I would like to reawaken this thread. I've recently come to working in Rust after three years of working in Elixir. The Elixir community managed to solve exactly this problem. The integration with Dash and third-party libraries is IMHO part of the delight of working in that language environment.

I recently completed the Rust 2021 survey and this was my #1 complaint about transitioning into Rust work. I hope the Rust community will take this more seriously.

jyn514 commented 3 years ago

Don't worry, we haven't forgotten about it. This is currently blocked on https://github.com/rust-lang/docs.rs/issues/1004; if you're interested in helping out, that would be a good place to start.

danieldjewell commented 3 years ago

Random thought: I know the compression stuff is in the works... What about generating a single-file-per-crate compressed archive anyway/separately and storing that for everything going forward? That would be a relatively simple change to the doc build process, would it not? (e.g. last build step: tar+zstd all of the built files; save archive somewhere, s3?)

Even if the needs of the compression project change and/or evolve, the data would still be there and could be easily converted to whatever format is required - most probably much more easily converted than retrieving millions of individual files from s3.

Doing this would:

  1. Provide an option right now to allow easy access to documentation (e.g. offline/airplane use)
  2. Make access/conversion much easier in the future by having several thousand files to download (and extract locally) instead of millions.
  3. Be easy to delete - if there is an alternate process developed, the compressed archives could easily be deleted.
  4. Possibly make diagnosing problems w/build output easier (like how CI tools archive build artifacts/logs)

Side note:

From experience, even using random access capable compressed tar archives, the size is usually much larger (i.e. worse compression ratios). https://github.com/martinellimarco/t2sz does indeed work well but since it has to make each file a ZSTD block in order to make the archive random seekable, the compression is (essentially) the same as using zstd on each individual file and then making a tar of the individual compressed files.

(One anecdotal example: 48 JSON files each ~4-6MB - total uncompressed size ~260MB. Compressed with zstd -19 --long=31 the tar.zst is ~1.9MB. Compressed with t2sz (indexed tar) it's ~16MB. That's a pretty big difference.)

jyn514 commented 3 years ago

Given that https://github.com/rust-lang/docs.rs/pull/1342 is almost done and would help with things besides downloadable docs, I don't think it makes sense to store both archived and unarchived files for the same crates. Note that we'd need to rebuild old crates (https://github.com/rust-lang/docs.rs/issues/464) no matter what for them to have downloadable docs, and that isn't feasible if we have to reupload individual files.

From experience, even using random access capable compressed tar archives, the size is usually much larger (i.e. worse compression ratios).

See https://github.com/rust-lang/docs.rs/pull/1342#issuecomment-861520392, having the smallest possible size is not a goal at the moment.

syphar commented 3 years ago

Since #1342 is progressing slowly, but continuously, I started thinking about downloadable docs again.

From what I see and understand from the code (please correct me if I'm wrong):

So, the current archives probably won't be usable for downloadable/offline docs, without

  1. merging the nightly-specific CSS/JS files into them (which probably wouldn't be too hard), or still include the toolchain files into our archives though we're not using them.
  2. rewriting all the CSS/JS href in the HTML files from absolute URL paths to relative local paths (which is not possible synchronously).
  3. fix more things?

So to have actually useable downloadable docs, wouldn't we have to run a second, clean, doc-build we would put into another archive for download? To me that does seem to be less effort and more clean than rewriting HTML and re-compressing/merging archives, does it?

jyn514 commented 3 years ago

So to have actually useable downloadable docs, wouldn't we have to run a second, clean, doc-build we would put into another archive for download? To me that does seem to be less effort and more clean than rewriting HTML and re-compressing/merging archives, does it?

The problem is that it uses the twice the CPU and storage, which doesn't seem like a great use of resources when all we want to do is change some of the links to point elsewhere.

Are changes needed related to intra-doc-links?

Hmm, most likely yes. @Kapeli how would you expect relative links to other crates to work? Currently they'd point to https://docs.rs, which seems to defeat the point of downloadable docs.

I think it might also be fine to make this the responsibility of whatever tool is packaging the docs into docsets; it's not that hard to parse HTML, there are lots of libraries for it.

since we serve toolchain-shared or unversioned-shared files separately, we exclude them from the doc-output, so they will be missing in the current archive.

It doesn't seem too hard to add them into the archive - there are many fewer shared files than per-crate files, it seems ok to do a bit of preprocessing for them on the docs.rs server. Alternatively, we could expose an API for "all shared files" and make it the responsibility of the client to combine them properly.

rewriting all the CSS/JS href in the HTML files from absolute URL paths to relative local paths (which is not possible synchronously).

What do you mean by synchronously? I don't think it would be unreasonable to do this at the time we download the archive.

syphar commented 3 years ago

So to have actually useable downloadable docs, wouldn't we have to run a second, clean, doc-build we would put into another archive for download? To me that does seem to be less effort and more clean than rewriting HTML and re-compressing/merging archives, does it?

The problem is that it uses the twice the CPU and storage, which doesn't seem like a great use of resources when all we want to do is change some of the links to point elsewhere.

It that's the only thing, probably true. But for all the logic specific to docs.rs someone would have to maintain the reverse-logic, right? ( for example changing the links to be relative, re-adding the shared files)

Are changes needed related to intra-doc-links?

Hmm, most likely yes. @Kapeli how would you expect relative links to other crates to work? Currently they'd point to https://docs.rs, which seems to defeat the point of downloadable docs.

I think it might also be fine to make this the responsibility of whatever tool is packaging the docs into docsets; it's not that hard to parse HTML, there are lots of libraries for it.

If the only use-case are offline docs, perhaps that could work, of course when we make more changes in how docs are generated for docs.rs, this additional revert-logic has to be adapted too, right?

If the use-case is also downloadable docs for everyone, that would be a high burden for everyone.

since we serve toolchain-shared or unversioned-shared files separately, we exclude them from the doc-output, so they will be missing in the current archive.

It doesn't seem too hard to add them into the archive - there are many fewer shared files than per-crate files, it seems ok to do a bit of preprocessing for them on the docs.rs server. Alternatively, we could expose an API for "all shared files" and make it the responsibility of the client to combine them properly.

👍

rewriting all the CSS/JS href in the HTML files from absolute URL paths to relative local paths (which is not possible synchronously).

What do you mean by synchronously? I don't think it would be unreasonable to do this at the time we download the archive.

Exactly :) Thinking about merging the archive I was initially thinking about a synchronous approach.

jyn514 commented 3 years ago

If the use-case is also downloadable docs for everyone, that would be a high burden for everyone.

Hmm, I don't understand the distinction you're drawing between "offline docs" and "downloadable docs for everyone". Are you expecting people to run curl docs.rs/some/api and use the archive without further changes? I don't really know why you would do that - for me personally, running cargo doc --no-deps would be easier than messing with curl.

jyn514 commented 3 years ago

It that's the only thing, probably true. But for all the logic specific to docs.rs someone would have to maintain the reverse-logic, right? ( for example changing the links to be relative, re-adding the shared files)

Another alternative I suppose is to still only upload one archive when the crate is first built, but when rewriting/adding shared files the first time it's downloaded, also reupload that cached archive so it's faster next time. That saves space for crates whose docs never get downloaded, without making it too slow to rewrite things, and also lets us delete the cache after a while if we don't need it. On the other hand, it seems complicated (since we'd have to deal with multiple requests at a time trying to download the same archive).

syphar commented 3 years ago

It that's the only thing, probably true. But for all the logic specific to docs.rs someone would have to maintain the reverse-logic, right? ( for example changing the links to be relative, re-adding the shared files)

Another alternative I suppose is to still only upload one archive when the crate is first built, but when rewriting/adding shared files the first time it's downloaded, also reupload that cached archive so it's faster next time. That saves space for crates whose docs never get downloaded, without making it too slow to rewrite things, and also lets us delete the cache after a while if we don't need it. On the other hand, it seems complicated (since we'd have to deal with multiple requests at a time trying to download the same archive).

I'll probably give the general rewrite-approach a try and then we can decide on when we do it. Right now rewriting and re-compressing feels like too much load to do in a request, especially for bigger crates, on top of handling parallel requests to the same release. On the other side we don't know how many of the crates are actually requested, so how much CPU would be wasted when we rewrite every release.

syphar commented 3 years ago

If the use-case is also downloadable docs for everyone, that would be a high burden for everyone.

Hmm, I don't understand the distinction you're drawing between "offline docs" and "downloadable docs for everyone". Are you expecting people to run curl docs.rs/some/api and use the archive without further changes? I don't really know why you would do that - for me personally, running cargo doc --no-deps would be easier than messing with curl.

I see your point, ok

syphar commented 3 years ago

Hmm, most likely yes. @Kapeli how would you expect relative links to other crates to work? Currently they'd point to https://docs.rs, which seems to defeat the point of downloadable docs. I think it might also be fine to make this the responsibility of whatever tool is packaging the docs into docsets; it's not that hard to parse HTML, there are lots of libraries for it.

Coming back to this, I'm not sure how the other official integrations look like with @Kapeli .

Since there is a custom search-index and table of contents in the docsets, there probably will be an additional processing step in any case. If that includes HTML rewrites, the JS/CSS paths could be fixed there, though that doesn't feel like a thing that I would officially serve from the pages as "downloadable docs" since it's broken on its own :)

danieldjewell commented 3 years ago

for me personally, running cargo doc --no-deps would be easier than messing with curl.

And that works great if you're on a computer with a GUI (as opposed to SSHed into a terminal w/o X-forwarding) and have a reasonably powerful computer (and the space to build the docs)

For me, I often use emacs or neovim across SSH from my phone (or even locally on my Phone... thank you Termux)... And I recognize that's somewhat atypical (but it sure is convenient!)

One issue I have personally with the entire Rust language documentation ecosystem is that it's entirely based on the premise that you're doing development on a system with a GUI and a web browser. The output is HTML, graphical, and there isn't (AFAIK) even an option to output to something that is text-mode friendly. (Using a text mode web browser is a hack-y solution.) This is in stark contrast to documentation systems for other languages like Sphinx for Python which will happily generate -using built-in builders- HTML (both single page and individual files), Plain Text, LaTeX, EPUB, PDF (through LaTeX), Texinfo (for GNU info), etc... This is also very different in the Python world because you often don't even need the generated documentation (which is usually made from the docstrings in the source) ... If you're in a Python REPL (regular python, ipython, jupyter, etc.), you can access the documentation quite easily with a simple call to help(numpy.array) or from the command line: pydoc numpy.array

I recognize that Rust's overall documentation system design is well beyond the scope here... But I do raise it to highlight what appears to be an assumption made in the design of the Rust ecosystem: That developers using Rust are working on (1) powerful systems (2) with GUIs and [without downloadable documentation] (3) are always connected to the internet.

On a related note, as I'm sure you're aware, there are many developers in other-than-the-richest-countries that might not have constant, continuous, and reliable internet. Beyond the (comparatively) luxurious issue of not being able to access the documentation when working on a laptop in an airplane without wifi, there are those who need to download things because they might have to go to a Cafe or other central point for internet access. Assuming someone has internet all the time as a design decision would tend to make it very difficult for people in that situation.

I don't want to come off as overly critical -- not my intention at all. (Maybe more of a "Hey, have you thought about this other aspect....") And I might be totally off here - I don't know... What are your thoughts?

syphar commented 3 years ago

@danieldjewell IMHO this is outside of the context of this issue, since it's about downloadable HTML docs.

I believe what you want can be covered when the json output for rustdoc is stable, at which point much tooling (including docs.rs) can just use the JSON output to generate any format.

Kapeli commented 2 years ago

Are changes needed related to intra-doc-links?

Hmm, most likely yes. @Kapeli how would you expect relative links to other crates to work? Currently they'd point to https://docs.rs, which seems to defeat the point of downloadable docs.

I'd prefer them to point to https://docs.rs and I can make Dash check if the other crate is installed locally before actually going to https://docs.rs.

I think it might also be fine to make this the responsibility of whatever tool is packaging the docs into docsets; it's not that hard to parse HTML, there are lots of libraries for it.

I prefer it if you just archive what you have now and I'll fix/rewrite/clean any issues I encounter.

syphar commented 2 years ago

since we serve toolchain-shared or unversioned-shared files separately, we exclude them from the doc-output, so they will be missing in the current archive.

It doesn't seem too hard to add them into the archive - there are many fewer shared files than per-crate files, it seems ok to do a bit of preprocessing for them on the docs.rs server. Alternatively, we could expose an API for "all shared files" and make it the responsibility of the client to combine them properly.

@jyn514 so adding these into the archive would be the first step? Then the only thing needed is rewriting the HTML.

I think it might also be fine to make this the responsibility of whatever tool is packaging the docs into docsets; it's not that hard to parse HTML, there are lots of libraries for it.

I prefer it if you just archive what you have now and I'll fix/rewrite/clean any issues I encounter.

That's good to hear @Kapeli !

@jyn514 what's your take on having the archive download in a (probably private) endpoint before having downloadable docs for everyone?

jyn514 commented 2 years ago

what's your take on having the archive download in a (probably private) endpoint before having downloadable docs for everyone?

What would be the purpose? If you just want a couple example archives to test and make sure they can be converted to docsets, I can grab them off S3 manually.

so adding these into the archive would be the first step?

I prefer it if you just archive what you have now and I'll fix/rewrite/clean any issues I encounter.

I think the archives should work fine as is - they still link to the shared files on docs.rs, the shared files just aren't in the archives themselves. But there's few enough shared files it's possible to download them one by one (around 10 for each nightly toolchain).

Hmm, I wonder if it makes sense to have a separate archive that has all the shared files at once, so you only have to make one request instead of ~15000.

syphar commented 2 years ago

I think it boils down to a use-case question:

  1. should this be more or less access to the internal build result (the archive)? Any user of this archive would have to rewrite all references to shared CSS or JS files and download these too. The links currently point to /, so rewriting is needed even when someone would want to use our hosted CSS/JS files.
  2. Or should the archive be more self-contained and directly usable? Then anyhow could curl the archive, extract it and open it locally.

For the case of Dash/Kapeli I assume that rewriting could be done by them, while fetching the static files is additional logic while already rewriting HTML.

Hmm, I wonder if it makes sense to have a separate archive that has all the shared files at once, so you only have to make one request instead of ~15000.

As I understand the use-case the archives are downloaded and converted one-by-one, not all at once, so putting all shared static into one archive doesn't make much sense.

@jyn514 how did you see the feature / the use-case?

jyn514 commented 2 years ago

@syphar I was imaging this is only for Dash (or other tools that want to preprocess it). Recall we discussed this above (https://github.com/rust-lang/docs.rs/issues/174#issuecomment-917683422).

Hmm, I wonder if it makes sense to have a separate archive that has all the shared files at once, so you only have to make one request instead of ~15000.

As I understand the use-case the archives are downloaded and converted one-by-one, not all at once, so putting all shared static into one archive doesn't make much sense.

So there are three options here:

  1. Put the shared files into each archive. This duplicates the files for every release, and specifically so that downloadable docs are more convenient; docs.rs doesn't need them in the archives.
  2. Don't archive the shared files at all; require them to be downloaded one at a time. This is the current situation. This works well for viewing a single crate, but if you're downloading the shared files in bulk then you need to make quite a lot of requests.
  3. Have a single archive for all shared archives; this allows downloading multiple shared files at once. This helps docs.rs too because it can cache the whole archive locally, which means many fewer requests to S3 (recall shared files are necessary on every single rustdoc page).

I'm ok with either 2 or 3. I don't think 1. makes sense given that Dash is preprocessing the files anyway.

syphar commented 2 years ago

After the discussion in discord we'll start with (2):

syphar commented 2 years ago

(3) having the single archive for static files would mean we have to update the archive every day, and re-upload it with the whole history. Also the performance-improvement would be very small since these static files are cached in the CDN anyways.

jyn514 commented 2 years ago

or 410

not 410, we plan to backfill the archives at some point

jyn514 commented 2 years ago

Mentoring instructions: add a new rustdoc_page route to https://github.com/rust-lang/docs.rs/blob/8e35ec47e4c46bea13fb33ada4d25887de93bdc3/src/web/routes.rs#L10 which matches /:crate/:version.zip and fetches the relevant file from S3, using storage.get_from_archive (see the example in https://github.com/rust-lang/docs.rs/blob/8e35ec47e4c46bea13fb33ada4d25887de93bdc3/src/web/rustdoc.rs#L350).

syphar commented 2 years ago

Mentoring instructions: add a new rustdoc_page route to

https://github.com/rust-lang/docs.rs/blob/8e35ec47e4c46bea13fb33ada4d25887de93bdc3/src/web/routes.rs#L10 which matches /:crate/:version.zip and fetches the relevant file from S3, using storage.get_from_archive (see the example in

Small correction here: it's actually storage.get, since we don't want to get something from inside archive, but the archive itself.

Plus, since archives might be gigabytes in size it would be good to stream them to the client, and not loading the whole archive into memory on the server.

https://github.com/rust-lang/docs.rs/blob/8e35ec47e4c46bea13fb33ada4d25887de93bdc3/src/web/rustdoc.rs#L350 ).

In this part we will see which archive to fetch (the rustdoc one).

syphar commented 2 years ago

Small update here after a chat with @pietroalbini and @jyn514

mabbamOG commented 2 years ago

any updates? this is a trivial issue ongoing for 4 years now...

jsha commented 1 year ago

(3) having the single archive for static files would mean we have to update the archive every day, and re-upload it with the whole history. Also the performance-improvement would be very small since these static files are cached in the CDN anyways.

Is the goal of downloadable docs to improve performance, or to ensure docs are available when offline? I'm assuming the latter.

In that case, it's important for tools that want to download docs to be able to enumerate all the static files that might be needed by a bundle of docs. It's not trivial to enumerate these just by processing HTML, because some are loaded by JS (e.g. search-index).

I think we probably need to start recording a mapping of rustdoc release -> list of static files, and provide that listing as part of the bundle for crate docs built with that release.

syphar commented 1 year ago

Is the goal of downloadable docs to improve performance, or to ensure docs are available when offline? I'm assuming the latter.

yes, the latter. More specifically this issue here is about offline doc readers that have to process the docs anyways to make them usable in their docsets.

In that case, it's important for tools that want to download docs to be able to enumerate all the static files that might be needed by a bundle of docs. It's not trivial to enumerate these just by processing HTML, because some are loaded by JS (e.g. search-index).

Since processing and HTML rewriting is needed anyways right now the idea was to download the missing assets when needed, where needed. The search-index is invocation specific and will be in the archive, while I think the offline doc readers wouldn't use our internal search. But that's up to them.

jsha commented 1 year ago

Ah, I misspoke about search-index. Good catch. But the problem exists for settings.js, settings.css, and search.js. They are loaded at runtime by other JS that uses rustdoc-vars to figure out their paths. Perhaps it's true that these pieces of functionality aren't needed by offline doc readers, but it seems like a potential source of fragility / worrying future bug.

For other, more typical <script> and <link> tags: do we know it's definitely the case that Dash and other offline doc readers will process all downloaded files to find such files and predownload them?