readthedocs / readthedocs.org

The source code that powers readthedocs.org
https://readthedocs.org/
MIT License
7.98k stars 3.57k forks source link

File tree diff API #11319

Open stsewd opened 3 months ago

stsewd commented 3 months ago

What's the problem this feature will solve?

Currently we can't link users to a specific page that has changed in a PR preview, or we can't suggest redirects for files that were renamed/deleted.

Describe the solution you'd like

File tree diff (FTD) is a feature that allows users to see the differences between the file trees of the generated documentation from two versions. This allows users to see which files were deleted or added, it can also list of files that were changed, and sort them by the number of lines changed on each file.

Haven't done much research about this, but doing a diff over two file trees should be a problem that's solved, worst case we can just do a manual diff using a set for the file tree, and using the unix diff command for the files. We could expose this as an API.

Note that this is a different product from the diff feature for Pull requests, that's focused on doing a diff over the HTML content itself, here we just care about the files that changed and how many lines were changed.

A basic example would be:

Current content:

New content:

Our API will list the files that were removed and the ones that were added, we can scope it to just track HTML files to start, and maybe limit the number of files returned.

There may be some tools that change all the pages on each build (like updating the commit on each file), that's were the sorting by number of lines changed comes into play).

Some features ideas that we can build on top:

Alternative solutions

We have discussed other solutions for this in the past, but they rely on the source files, not in the generated files. That's a problem since our serving and redirects work over the generated HTML files.

Alternative names

Additional context

humitos commented 3 months ago

All of this sounds like a really good idea to me. I see how we can implement some nice features on top of this data 👍🏼

Haven't done much research about this, but doing a diff over two file trees should be a problem that's solved, worst case we can just do a manual diff using a set for the file tree, and using the unix diff command for the files

We may need to do this at S3 if that's possible. Otherwise, we will need to rclone the two versions first (which may take lot of time -- however, we may be able to filter by *.html). Something along these lines would be a good first step to research about in my mind.

humitos commented 2 months ago

We talk about using the file hash that S3 can returns via the API: https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObjectAttributes.html and compare it with the hash we can calculate locally.

ericholscher commented 2 months ago

Next step here to do some research, and figure out what the best approach will be. If it looks like it's pretty straight we can move forward with implementation after that.

stsewd commented 1 month ago

rclone provides a way to get the diff between two sources, remote or local.

https://rclone.org/commands/rclone_check/

$ ls a
changed.txt  new.txt  unchanged.txt
$ ls b
changed.txt  deleted.txt  unchanged.txt
$ rclone check --combined=out.txt /usr/src/app/checkouts/readthedocs.org/a /usr/src/app/checkouts/readthedocs.org/b
$ cat out.txt
+ new.txt
- deleted.txt
= unchanged.txt
* changed.txt

This allows us to do the comparison of two versions that are already in S3, or do the comparison of files that are about to be uploaded with a version that's in S3, this can be useful for PR build, where we know that a useful diff would be from the PR preview and the latest version. That should save a round to S3.

There is also the --include option, so we can filter only by HTML files --include=*.html

ericholscher commented 1 month ago

Nice, that seems like a great approach to start. I assume that it's still hitting a bunch of S3 calls, but probably fine since builds happen way less frequently than doc serving, so it'll probably be a small number of total queries compared to doc serving :+1:

stsewd commented 4 weeks ago

I assume that it's still hitting a bunch of S3 calls

Yeah, there is no way around that. Results will be cached, obviously (and diff will be generated on demand).

If we are good with that, I can start writing a small design doc with the interface of the API and how it could integrate with PR builds.

ericholscher commented 4 weeks ago

I think it's a great place to start. I think we can figure out the architecture and workflow, and if we hit scaling issues, we can always reevaluate that part, but I think the work to integrate it is definitely worth pushing forward on.

humitos commented 4 weeks ago

We talk about using the file hash that S3 can returns via the API: docs.aws.amazon.com/AmazonS3/latest/API/API_GetObjectAttributes.html and compare it with the hash we can calculate locally.

@stsewd have you check this idea? any thoughts, pros/cons?

rclone.org/commands/rclone_check

Would this give us the number of lines changed on each file or only added/modified/deleted?

stsewd commented 4 weeks ago

@stsewd have you check this idea? any thoughts, pros/cons?

rclone doesn't download the files to do the checks.

Checks the files in the source and destination match. It compares sizes and hashes (MD5 or SHA1) and logs a report of files that don't match. It doesn't alter the source or destination. If you supply the --download flag, it will download the data from both remotes and check them against each other on the fly. This can be useful for remotes that don't support hashes or if you really want to check all the data.

Would this give us the number of lines changed on each file or only added/modified/deleted?

No, to get the lines that changed, we need to download the files.