Closed carols10cents closed 5 years ago
Ok, so in thinking about the implementation of this a bit more I'm starting to wonder if this is worth implementing right now. It might be, but I have concerns.
So implementing this as proposed requires looking at all the files in a .crate
archive. We currently have a script, render-readmes, that we can run to rerender readmes for all versions of all crates. The script currently downloads the .crate
files from s3, untars them, and extracts the README file for rendering. We run this script whenever we change the implementation of the readme rendering, but not on a scheduled basis or anything.
Theoretically, once we implement however we're going to count documentation, that computation shouldn't change (at least I don't expect it to change as often as README rendering does). So I don't think we'll need a "re-assess documentation" script, or at least it won't be as essential as the render-readmes script is.
Ideally, processing a version to count its documentation should mean getting its .crate
archive, untaring it, counting the lines of doc and code, and storing those numbers associated to the version and never having to do that again for that version of that crate. I'm not concerned with the resources needed to do that once, in the background.
I don't necessarily want this to happen on the publish thread, though, and if this calculation fails, it shouldn't reject the publishing of a crate.
So perhaps what we need is a script that we'd schedule to run every so often (every few hours? once a day?), finds all the versions that don't have this data, computes it, and stores it.
Thinking longer term, though, there are potentially other analyses we'd like to do on the files in a crate, and it'd be nice if we only had to have/untar the .crate
files once, then feed that to all the different analyses when a new crate is published. This could even be an entirely separate service! I think npm has this idea of "followers"/microservices that get a stream of data from the registry as packages are published to be able to do whatever with that data, without affecting the main registry. I don't know if we're at the scale to need that yet, but I hope we will be someday-- so does that mean we implement something like that now, so that we don't have to extract this later? I don't know! Tradeoffs!
TL;DR I'm struggling with whether adding the complexity of this feature to crates.io's codebase, or adding the complexity of support for this feature and others like it to crates.io's codebase, is a good idea at this point in crates.io's scale/maturity or not.
So I've taken a look at this issue in RFC 1824, and the degree to which Rust code is well document is a score out of 1, with scores closer to 1 being better document. The calculation in the RFC was:
(No of doc comment lines
+ No of README lines
) / (LOC in Rust
- No of doc comment lines
)
@carols10cents, I assume the LOC in Rust includes blank lines? I'm a bit worried that the RFC uses a precision of 2, I feel it should be higher, maybe we could even use f32
or f64
since both are supported in Postgres.
Regarding your points @carols10cents:
So implementing this as proposed requires looking at all the files in a .crate archive. We currently have a script, render-readmes, that we can run to rerender readmes for all versions of all crates. The script currently downloads the .crate files from s3, untars them, and extracts the README file for rendering. We run this script whenever we change the implementation of the readme rendering, but not on a scheduled basis or anything.
Think we could leave the recalculation script out of scope? Or did we want that as a task in this issue?
So perhaps what we need is a script that we'd schedule to run every so often (every few hours? once a day?), finds all the versions that don't have this data, computes it, and stores it.
Me being me, I was thinking we could run this script or service every time we published a crate to keep the metrics up to date ASAP :). Would that be a tad ambitious?
I can already see an issue in that if we continuously update the DB with scores, and new versions and crates are being published past a certain rate, that'd mean that it becomes difficult to impossible to keep a real time tracking of which crates are in the 20% since there's new information changing that are just above the cutoff point for measuring this.
Thinking longer term, though, there are potentially other analyses we'd like to do on the files in a crate, and it'd be nice if we only had to have/untar the .crate files once, then feed that to all the different analyses when a new crate is published. This could even be an entirely separate service! I think npm has this idea of "followers"/microservices that get a stream of data from the registry as packages are published to be able to do whatever with that data, without affecting the main registry. I don't know if we're at the scale to need that yet, but I hope we will be someday-- so does that mean we implement something like that now, so that we don't have to extract this later? I don't know! Tradeoffs!
I would like to learn how to set up a microservice :). That and perhaps it'd make the crates.io backend less monolithic? However, would this also mean that it becomes harder to set this up on private crates.io instances and local development environments? If this is a problem, maybe we could use Docker or another containerisation technology to manage this for us.
Then again, a senior architect once told me of the YAGNI philosophy (you ain't gonna need it), which could be applied to this microservice idea. What other analyses were you thinking of running on extracted codebases?
npm's followers idea sounds a bit like this issue regarding push infrastructure for crates.io :). Would you say there's a link?
Hmm this task is starting to look bigger than I thought it'd be :). And yes, I'd be happy to take this task up once we've decided what needs to be done :).
I just finished my last exam :). What were you thoughts on this so far @carols10cents?
So it's been pointed out to me in IRC by insaneinside that the core logic is prone to being gamed by uploading a project with many empty doc comments, and that a potential way around that is to potentially strip the spaces out and then see if there is still anything in the line. However, this can be circumvented by someone uploading files with gibberish doc comments e.g. Lorem Ipsum.
Until we see evidence of crates gaming this, do you think it's worth worrying about at all? Any crate that has a bunch of empty lines as documentation, or lorem ipsum is going to end up avoided either way
I don't think we're going to end up doing this.
Part of RFC 1824, rust-lang/rust#41616.
"Well Documented" badge implementation
For each crate published, in a background job, unpack the crate files and calculate the ratio of lines of documentation to lines of code as follows:
grep -r "//[!/]" --binary-files=without-match --include=*.rs . | wc -l
find . -name '*.rs' | xargs wc -l
We would then add the lines in the README to the lines of documentation, subtract the lines of documentation from the total lines of code, and divide the lines of documentation by the lines of non-documentation in order to get the ratio of documentation to code. Test code (and any documentation within test code) is part of this calculation.
Any crate getting in the top 20% of all crates would get a badge saying "well documented".
Please let me know if you have any questions, potential implementers!