Closed abhijithnraj closed 1 year ago
I followed manim's own graphical unit tests since they had a really good underlying testing framework for frame comparison both from the final result and frame by frame comparison of the video.
@abhijithnraj I started looking at Manim's testing approach, but it was looking more complex than I could make use of in a reasonable timeframe. There may well be a better approach to verifying the output of test runs, but that should be somewhat straightforward to implement when the time is right. I'm a big fan of "good enough" when getting an initial test suite off the ground. I also haven't touched testing for animation output at this point.
For the record, pytest makes it really straightforward to choose which tests you want to run. I'm assuming the video tests would be a lot slower than the image tests. You can run the animation tests less often than the image tests. As the test suite grows, you can choose which tests to run at various stages of the development cycle.
No problem! "92 files changed" was a nice motivation to land that flag, right? :)
LOL 😜 and thank u!
I'm a big fan of "good enough" when getting an initial test suite off the ground. I also haven't touched testing for animation output at this point.
Agree. Not sure we even need animation testing at any point since the image file output should be identical to the final video frame. It's not 100% foolproof since theoretically you can get to the same final state through multiple paths, but is likely be good enough esp when considering much longer it would take to generate and compare the animated video files.
Okay, I'm really skeptical of trying to generate the sample repo on every test run. I just made the changes to use git-dummy during each run, on the use_gitdummy branch in my fork. (This is a branch off of the start_e2e_testing
branch.)
I ran the test suite once, and all three tests failed as expected. All of the images looked correct, so I copied those over as reference files. Then I ran the test suite three more times, and each time one test would fail. Here's the command that's using git-dummy to generate the sample repo:
cmd = f"{git_dummy_path} --commits=10 --branches=4 --merge=1 --constant-sha --name=sample_repo"
This creates an interesting repo, which is good for testing. But I think there are two non-deterministic aspects. One, I believe the merge that happens is random each time. Two, and maybe related to that, it is generating different hashes at times.
I'm skeptical because as you start to model more complex interactions I think you're going to see more issues like this pop up. git-dummy
is an interesting and useful idea, but I'm not sure it's worth trying to keep that project as robust as it needs to be to support this project. Then again, I'm not in the weeds in git-dummy, so maybe there's a straightforward way to address this. Maybe a no-random flag or something?
I'm not in any rush, and this only affects the tmp_repo() fixture in contest.py. It's not difficult to trade out approaches for a while longer.
I meant to add you should be able to try this out by:
git pull origin use_gitdummy
pytest
three times, without the -x
flag; you want it to generate all three image files on each test run.I tried removing --merge=
from the git-dummy call, but that still didn't fix things. Maybe --constant-sha is not being used when creating branches?
I have to step away from this for most of the day, but if I'm quiet I'll be back tonight or tomorrow.
Ah shoot, I think there is a random element to where the branches diverge at. BUT, this can be set using existing git-dummy flag called --diverge-at=X
, where X is the index of the commit on the base branch to base the new branches off of.
Adding it to the end your command:
cmd = f"{git_dummy_path} --commits=10 --branches=4 --merge=1 --constant-sha --name=sample_repo --diverge-at=2"
This should make all the branches diverge from the main branch at the 2nd commit. But you can choose X to be what works best i.e. gives the repo the characteristics you want.
Merges (when using the --merge=Y,Z
option) should always happen at the final commit branch(es), so I don't think that should affect any of the SHAs.
Hopefully that should do it.
That does seem to fix it on my end. I'll try this on Windows later, and do a little more cleanup, and then update the PR.
Are there any changes you would make to that git-dummy command, to facilitate any further testing you anticipate?
Let's see... I think what you have now (the one above) should be good enough to start. With that we should be able to simulate the following commands:
Commands that will likely need to wait / might need special accommodations:
add (with no args for now, but later we can add a feature in git-dummy to create untracked/staged files)
But this is part of the problem with the current approach. When you modify the command, can you reliably do that without affecting any of the previous output? I'm anticipating someone having to regenerate all of the reference files every time there's a change made to the git-dummy call.
That is why my initial thought was to have a git-dummy call specific to each test case, that generates a minimum viable repo needed for that case without impacting any others, and have the repo cleaned up after each test case. Since generating these test repos is a lightweight operation it should perform ok. But I think that's a point we can get to incrementally.
To kick things off I think we can stick to what we have and use a single repo like we've been discussing. This will enable us all to play with the workflow and see how it feels. Then as we create more specific test cases, especially ones that cover sub-scenarios within each command (for example regular merge vs fast-forward merge, checkouts/switches to diverged vs same-line branches, etc) we can break out those test cases with their own customized git-dummy repos as needed.
If need be we can also come up with a programmatic way to have subsets of test cases all use a single repo, to avoid regenerating the same repo over and over, but again this a problem for a bit down the line.
But this is part of the problem with the current approach.
Last thing - wouldn't a static sample repo have the same problem? If it needed to be changed in some way to accommodate a new scenario/test case, we'd have to regenerate all the assets as well.
That is why my initial thought was to have a git-dummy call specific to each test case, that generates a minimum viable repo needed for that case without impacting any others, and have the repo cleaned up after each test case. Since generating these test repos is a lightweight operation it should perform ok. But I think that's a point we can get to incrementally.
That's pretty interesting. There's a very clear benefit of not having an extra 100+ file sample repo in the test resources.
I'm still a little hesitant about it though, because I could see this splintering into groups of tests that are more driven by issues with git-dummy than how the tests should be grouped for the purposes of testing git-sim.
None of these are dealbreakers though, and I think you'll be quite fine going in either direction. You could also write a script that runs the test suite, expects a whole bunch of failures, and then rewrites the output files as the new reference files. But as I write that out, that sounds like a terrible idea! Tests really need to be stable to be reliable. :/
Last thing - wouldn't a static sample repo have the same problem? If it needed to be changed in some way to accommodate a new scenario/test case, we'd have to regenerate all the assets as well.
Oh, I don't know! My first thought was you just execute new git commands in that repo, thus not changing any of the existing hashes. But maybe you do a git reset --hard old_hash_num
before some test calls? Copying the static repo is quick, and modifying state is not hard as the test run progresses.
I think this all points to the fact that testing a tool that acts on a git repository is going to get complex at some point.
So, at this point I'll plan to clean up the git-dummy based approach and then update the PR. Does that sound good?
Took a quick read through but am on the go and need to look closer later and think about it more.But yes for now let’s go for the git-dummy single repo approach and take it from there.On Jun 14, 2023, at 12:49 PM, Eric Matthes @.***> wrote:
That is why my initial thought was to have a git-dummy call specific to each test case, that generates a minimum viable repo needed for that case without impacting any others, and have the repo cleaned up after each test case. Since generating these test repos is a lightweight operation it should perform ok. But I think that's a point we can get to incrementally.
That's pretty interesting. There's a very clear benefit of not having an extra 100+ file sample repo in the test resources. I'm still a little hesitant about it though, because I could see this splintering into groups of tests that are more driven by issues with git-dummy than how the tests should be grouped for the purposes of testing git-sim. None of these are dealbreakers though, and I think you'll be quite fine going in either direction. You could also write a script that runs the test suite, expects a whole bunch of failures, and then rewrites the output files as the new reference files. But as I write that out, that sounds like a terrible idea! Tests really need to be stable to be reliable. :/
Last thing - wouldn't a static sample repo have the same problem? If it needed to be changed in some way to accommodate a new scenario/test case, we'd have to regenerate all the assets as well.
Oh, I don't know! My first thought was you just execute new git commands in that repo, thus not changing any of the existing hashes. But maybe you do a git reset --hard old_hash_num before some test calls? Copying the static repo is quick, and modifying state is not hard as the test run progresses. I think this all points to the fact that testing a tool that acts on a git repository is going to get complex at some point.
So, at this point I'll plan to clean up the git-dummy based approach and then update the PR. Does that sound good?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
@ehmatthes Hey sorry about earlier I tried responding via email while in an Uber and the result was kinda... ugly.
I could see this splintering into groups of tests that are more driven by issues with git-dummy than how the tests should be grouped for the purposes of testing git-sim.
I'm less worried about this because the whole point of git-dummy is to be tailored to creating exactly the types of repos we want for git-sim. git-dummy can be molded into exactly what we need for git-sim, not vice-versa.
You could also write a script that runs the test suite, expects a whole bunch of failures, and then rewrites the output files as the new reference files. But as I write that out, that sounds like a terrible idea! Tests really need to be stable to be reliable.
After chewing on that for the day I feel that if we were talking about unit tests (or even functional tests for more specific cases), then the stability really matters a lot more than for broad end-to-end tests that we're implementing. A huge blind spot in my thought process so far is that if we make a fairly common type of code change to git-sim that impacts all image outputs - say adding a logo into the corner of the images, or simply changing a font, or changing the spacing between commit circles and arrows - all the reference images will need to be recreated for all test cases. Therefore it's sort of implied that we're not doing this end-to-end testing for stability-of-test-outputs over time as git-sim changes. We're doing it so that in the course of everyday development of specific features/bugfixes, we can easily verify that something else didn't get broken. But our end-to-end tests won't help at all if a global change is made since that will likely just break everything with the only viable option being a total overhaul of all reference images. So circling back to your quoted statement above, I think it could be useful to have a script like you suggested that intentionally rewrites all outputs as new reference files, since this would be a fairly common operation when git-sim is altered in a global way.
But maybe you do a git reset --hard old_hash_num before some test calls? Copying the static repo is quick, and modifying state is not hard as the test run progresses.
At that point you're not only adding hardcoded git commands external to git-dummy when they may as well be captured in git-dummy itself, but you're also hardcoding references to specific git states (like resetting to specific commit shas) which isn't as flexible as being able to deal with arbitrarily generated repos. If the sample repo changes down the line, then all those git reset commands need to be updated too.
I think this all points to the fact that testing a tool that acts on a git repository is going to get complex at some point.
Hahaha yes this.
So, at this point I'll plan to clean up the git-dummy based approach and then update the PR. Does that sound good?
That would be wonderful.
No problem on the earlier reply. I've been responding quickly because I've had time this week, but I have no expectation that people reply immediately in open source work.
These are all good points. I jumped into this because Git is often seen as more difficult and confusing than it needs to be. This is a great tool for helping people see how Git works, and the fact that it acts on their own repository is fantastic. That's way more impactful than diagrams about a generic sample project. I'm also looking for real-world ways to get better at testing. I really find it satisfying to write the first tests in a project.
One of my testing principles is to not blindly pursue coverage. There are many projects where covering the most critical pathways and outputs is more beneficial to the overall project than trying to build out a test suite that covers all edge cases, but then becomes difficult to use and maintain. All the issues we've been discussing can be addressed quite reasonably with a test suite that has just enough coverage.
That's a really good point about image output changing over time. It should change! I'm also doing this work to get better at using pytest. You can add a CLI argument to pytest that rewrites the reference files. So that work wouldn't have to be separate from the test suite; you could run pytest
and make sure everything fails as expected, and then run pytest --rewrite-ref-images
or something like that, and it would rewrite all the reference image files for the current OS.
I updated the PR. Thank you so much for the discussion, I've learned a lot from this work!
I jumped into this because Git is often seen as more difficult and confusing than it needs to be. This is a great tool for helping people see how Git works, and the fact that it acts on their own repository is fantastic. That's way more impactful than diagrams about a generic sample project. I'm also looking for real-world ways to get better at testing. I really find it satisfying to write the first tests in a project.
Thank you! I'm honestly so happy that git-sim is (hopefully) helping folks. Took me years to really get comfortable with git and there are still plenty of commands/workflows I haven't used much if at all. And testing is an area I've barely scratched the surface on so your input and work are a huge help. Not sure when I would have gotten this going otherwise :D. Please let me know if you include git-sim in that article/newsletter I'd like to check that out and pass it along through my "initial commit" channels.
All the issues we've been discussing can be addressed quite reasonably with a test suite that has just enough coverage.
Great point. I'm excited to start finding out what that balance is for git-sim. I anticipate it will be something like one test case per "type of git operation that is worth discussing", like what I mentioned above regarding various types of merges, resets, checkouts, etc. These tend to correspond close to 1-1 with logic paths in the git-sim codebase.
You can add a CLI argument to pytest that rewrites the reference files. So that work wouldn't have to be separate from the test suite; you could run pytest and make sure everything fails as expected, and then run pytest --rewrite-ref-images or something like that, and it would rewrite all the reference image files for the current OS.
Yes, this sounds perfect. I may ping you at some point if I have questions or run into issues with it :D
I updated the PR. Thank you so much for the discussion, I've learned a lot from this work!
Seriously, thanks again! I learned a lot too and will try to go through the PR soon and play around with it before (hopefully) merging. Will reach back if any notable changes need to be made.
I've been using Git for 15+ years, but I've gotten by that whole time almost entirely using a small subset of commands: init, add, commit, branch, merge, pull. Sometimes I fetch! But I've never rebased. I'm comfortable sorting out whatever I need to do though, because I have a fairly good understanding of how Git works internally. One of the reasons I really like this project is because it helps build an accurate understanding for people of how Git actually works, while using a project they're already familiar with (their own).
The article I mentioned will go out next Thursday, so I'll share a link here when it's out.
I anticipate it will be something like one test case per "type of git operation that is worth discussing", like what I mentioned above regarding various types of merges, resets, checkouts, etc. These tend to correspond close to 1-1 with logic paths in the git-sim codebase.
This makes perfect sense, and it's what I was expecting you to aim for. I will be curious to hear later on if the test suite has caught any issues before making a release. It certainly should at some point!
There's one other set of tests I think you should consider sooner rather than later. I think it would be really worthwhile to write one test for each of the global options. You might pick a single git operation, and run that one operation with each global setting once.
If you want some ongoing help, I'm happy to keep at this for a bit, at a slower pace than I jumped in this week. pytest has some really nice ways to do this efficiently. Parametrization of tests is amazing. Here's how you could write one test, that would actually be run once for every global option:
pytest.mark.parametrize(['--all', '--invert-branches', '--light-mode', '--reverse'])
def test_log():
# Code that tests log output.
assert compare_images(fp_generated, fp_reference, global_option)
The compare_images()
function would be updated to compare the generated image against the correct reference file.
If any of those fail, the output tells you exactly which global option/s resulted in the error.
Great work on this project, and let me know if you'd like more help with the test suite.
Hi again @initialcommit-io. The article that led me here is posted now, if you're curious to skim it.
@ehmatthes That's awesome!! Thanks for including git-sim!!!!! 😸
Hey @ehmatthes,
So I just tested this out in Windows and am getting permissions errors in the compare_images()
method in utils.py
:
PermissionError: [Errno 13] Permission denied: 'C:\Users\Jack\AppData\Local\Temp\pytest-of-Jack\pytest-7\sample_repo0\sample_repo'
C:\Users\Jack\Desktop\git-sim.venv\Lib\site-packages\PIL\Image.py:3236: PermissionError
I tried printing out the paths to the reference files and generated files in the test_log()
method, and the reference file path seems to be correct but the generated file path looks like it's grabbing the current directory as .
.
Did you run into any permissions issues in your Windows testing?
Meant to mention I tested running gitbash as administrator but still got the permissions errors.
I did not run into any file permission issues during development, and I don't see any now.
I just modified test_log()
to show the path to the generated file:
def test_log(tmp_repo):
"""Test a simple `git-sim log` command."""
...
print("fp_generated:", fp_generated)
assert compare_images(fp_generated, fp_reference)
Here's the output of running just this test on macOS:
(.venv)$ pytest -s -k "test_log"
...
fp_generated: /private/var/folders/gk/y2n2jsfj23g864pdr38rv4ch0000gn/T/pytest-of-eric/pytest-414/sample_repo0/sample_repo/git-sim_media/sample_repo/images/git-sim-log_07-04-23_15-33-49.png
And here's the same output on my Windows VM:
(.venv)> pytest -s -k "test_log"
...
fp_generated: C:\Users\eric\AppData\Local\Temp\pytest-of-eric\pytest-71\sample_repo0\git-sim_media\sample_repo0\images\git-sim-log_07-04-23_16-07-22.png
I get similar output in a Git Bash shell on my Windows VM. What output do you get for the path to the generated image file?
The path to the generated file is being read from the output of the git-sim log
command. You can cd into the sample repo generated by the test suite and run git-sim log
there manually. (Make sure you are in an active venv, where git-sim
is available):
> cd C:\Users\eric\AppData\Local\Temp\pytest-of-eric\pytest-71\sample_repo0
> git-sim log
Output image location: C:\Users\eric\AppData\Local\Temp\pytest-of-eric\pytest-71\sample_repo0\git-sim_media\sample_repo0\images\git-sim-log_07-04-23_16-11-38.jpg
> git-sim -d --output-only-path --img-format=png log
C:\Users\eric\AppData\Local\Temp\pytest-of-eric\pytest-71\sample_repo0\git-sim_media\sample_repo0\images\git-sim-log_07-04-23_16-16-55.png
The first command here shows that git-sim log
runs in the repo generated by the test. The second command shows that the full command run by the test suite, git-sim -d --output-only-path --img-format=png log
, runs as well.
I would be curious to know if git-sim is generating different output in your environment, or if the test code is not parsing the output properly in your environment. If fp_generated
is not a file path, I'd add one more line to test_log()
:
def test_log(tmp_repo):
"""Test a simple `git-sim log` command."""
...
print("output:", output.stdout.decode().strip())
print("fp_generated:", fp_generated)
assert compare_images(fp_generated, fp_reference)
Sorry about that. It looks like the changes I was making to git-sim and trying to test caused an error that led to the output not being generated at all, hence the confusion.
After tweaking my change, the tests now run, but still fail due to bad ratio diff. I did confirm it is using the Windows reference files to compare, but looks like the fonts used differ between my output and the Windows reference files. I think this happened to you at one point as well, but we thought it was across OS's. This font difference makes the ratio_diff up to 5%.
Attaching images of the git-sim log Windows reference file and my output showing the differing fonts (see below). On my system it uses Monospace Bold which is the font that git-sim is configured to use for all output. I assume the reference files were generated on a system where Monospace was not available as a system font (or maybe just not in the correct bold weight), so it used the default system font as a fallback.
This could get hairy as devs try to test across different systems... I wonder if we could get around this by adding a global font
argument to git-sim, which we can specifically set in tests to a default font that we know exists in each respective environment.
Sorry about that. It looks like the changes I was making to git-sim and trying to test caused an error that led to the output not being generated at all, hence the confusion.
Okay, that's helpful. I made a note to add a line to compare_images()
to check that the image file exists, before actually doing the comparison. We can make a similar assertion about the output of each call, so in the future you'll know more quickly whether there was an issue with the output message or the actual image file.
After tweaking my change, the tests now run, but still fail due to bad ratio diff. I did confirm it is using the Windows reference files to compare, but looks like the fonts used differ between my output and the Windows reference files. I think this happened to you at one point as well, but we thought it was across OS's. This font difference makes the ratio_diff up to 5%.
Attaching images of the git-sim log Windows reference file and my output showing the differing fonts (see below). On my system it uses Monospace Bold which is the font that git-sim is configured to use for all output. I assume the reference files were generated on a system where Monospace was not available as a system font (or maybe just not in the correct bold weight), so it used the default system font as a fallback.
This was really helpful to know as well. I think there's a straightforward solution that makes testing much easier across systems. I changed all the instances of "Monospace" in git_sim_base_command.py
to "Courier New". Here's the output on macOS:
Here's the output on Windows:
I believe this works better because "Monospace" just causes Manim to use the system's default monospace font. "Courier New" is a more specific font, which happens to be available on most macOS and Windows Systems. I would be curious to see what the git-sim log
output looks like on your system if you change Monospace to Courier New.
I think the most reliable fix for this would be to bundle a specific font with the test suite. I added a .ttf file to the test suite, and modified construct()
in log.py to use the bundled font. This is a more complex change than just making the font a setting, because from my brief reading Manim expects bundled fonts to be used in a context manager.
To experiment with different fonts in the git-sim log
output, I changed all references to "Monospace" to self.font
, and modified __init__()
in GitSimBaseCommand
:
class GitSimBaseCommand(m.MovingCameraScene):
def __init__(self):
super().__init__()
self.init_repo()
self.fontColor = m.BLACK if settings.light_mode else m.WHITE
self.font = "Monospace"
self.drawnCommits = {}
...
With this change, all tests still pass. If I set self.font = "Courier New"
, tests fail but I get the output shown above.
To use a bundled font, I had to make the following change to construct()
in Log
:
def construct(self):
with m.register_font("/Users/eric/projects/git-sim/tests/e2e_tests/ProggyClean.ttf"):
self.font = "ProggyCleanTT"
if not settings.stdout and not settings.output_only_path and not settings.quiet:
print(
f"{settings.INFO_STRING} {type(self).__name__.lower()}{' --all' if self.all_subcommand else ''}{' -n ' + str(self.n) if self.n_subcommand else ''}"
)
self.show_intro()
...
This made fairly ugly output, but I believe the output would be identical on all systems:
I don't think I'd bother using a tiny font like this. I was experimenting with a tiny font, and that's one that showed up in a quick search. Also, I'm guessing you wouldn't have to add a context manager to every command's construct()
method. I imagine with a better understanding of the codebase there's one place to put that context manager that all files can use.
Thanks for all the deets! Yes using Courier New in git_sim_base_command.py
makes the tests work.
The quickest and least invasive solution for now seems to be adding a global flag for the user to specify the font, and then hardcoding that as Courier New from the tests raw command.
This will have the added benefit of letting the user customize the font. I will try this out and update here.
I made a note to add a line to compare_images() to check that the image file exists, before actually doing the comparison. We can make a similar assertion about the output of each call, so in the future you'll know more quickly whether there was an issue with the output message or the actual image file.
And oh yes - this would be great!
Ok - this is implemented on the dev branch. I added a --font
global option which is now set by the test suite. I also updated the reference files for Macos since the test suite on Mac and Windows now uses Courier New font.
Wanna try it out and lemme know if it works for you?
Okay, the dev branch works for me. I did have to rebuild my virtual environment. That wasn't obvious until I tried to run a git-sim command manually.
I removed the Windows-specific files, and the code that modified the reference file path to look for a Windows-specific file.
I also tried using a bundled font at the test-only level:
def test_log(tmp_repo):
"""Test a simple `git-sim log` command."""
raw_cmd = "git-sim log"
cmd_parts = get_cmd_parts(raw_cmd)
os.chdir(tmp_repo)
with m.register_font("/Users/eric/projects/git-sim/tests/e2e_tests/ProggyClean.ttf"):
output = subprocess.run(cmd_parts, capture_output=True)
fp_generated = Path(output.stdout.decode().strip())
...
This didn't work, and I wasn't too suprised by that. I believe subprocess.run()
opens a new shell, which doesn't see the changes that register_font()
made to the environment. That said, I think there's a way to make this work if we end up running into the same environment-specific testing issues. I think you could make the test font .ttf file available in a shell, and then make the test's git-sim calls run in that same shell. If you run into an issue with different people generating slightly different but correct images, please ping me and I'll be happy to try building this out.
I made a PR so you can pull these small improvements if you want. Feel free to reject the PR and incorporate those changes whenever appropriate.
Sure I'll merge the PR, but can you make it to the dev branch instead of main? Looks like it's including 2 of the commits I made since those don't exist on main yet... But those should go away if you change it to the dev branch.
I'm newer to committing to others' projects, so some of my Git habits are unclear. I forgot to run Black before submitting the PR, so now the PR would include three commits:
9e0b88 (HEAD -> dev, update_tests) Use Black formatting conventions.
6d0582 Remove Windows-specific test files.
65050d Additional assertions about filepaths that are generated during testing.
7c9c32 (upstream/dev) Update e2e test reference files for macOS
Should I submit a PR with these 3 commits? Do you want me to squash them? Do you squash them when you merge?
Thanks for running Black - this is my first time using it as suggested by other contributors. On my Mac I was able to get Black to to run automatically when exiting Vim (via a Vim plugin), but had issues setting that up on my current Windows machine so now it's manual again, so I often forget to run that myself too 😸.
You can just submit the PR to the dev branch with the 3 commits, I'll squash and merge them in the GitHub interface.
Sure thing, I'm trying to use Black more consistently in my own work as well. I resubmitted the PR.
Merged it and added tests for the remaining non-networked subcommands. I regenerated all the reference images on my Windows machine and then tested on my Mac, where 12/18 tests passed and the other 6 failed due to bad ratio diff. Most of these were barely above the threshold, but the biggest difference was about 0.0126.
I'm thinking this is just due to how the image files are generated on different platforms (regardless of font). Anyway, I'm thinking we can just boost the acceptable ratio diff from 0.0075 to 0.015 to take into account these subtle differences.
(the updated reference files are pushed up in the dev branch if you want to test on both your systems to compare)
I got the same results on my Mac. The highest ratio came from the log command, but I opened the generated image and the reference image and couldn't see any difference at all. I think a ratio of 0.015 sounds perfectly reasonable. I'm going to be a little embarrassed, but mostly curious if someone who's more familiar with working with images comes along and has a much simpler way to test image output.
The test suite is slower now, as expected when running 6 times as many tests. It takes 80s to run the test suite on my system. I was tempted to focus on the compare_images()
function, but then I remembered to profile. I added profiling code to the compare_images()
function, and around each git-sim call. It takes less than 0.1s to run compare_images()
, but about 3-6s to run each git-sim command:
Spent 3.334 seconds running test_add.
Spent 0.085 seconds in compare_images().
.Spent 4.103 seconds running test_branch.
Spent 0.087 seconds in compare_images().
.Spent 5.484 seconds running test_checkout.
Spent 0.088 seconds in compare_images().
.Spent 5.732 seconds running test_cherrypick.
Spent 0.088 seconds in compare_images().
.Spent 3.235 seconds running test_clean.
Spent 0.085 seconds in compare_images().
.Spent 3.205 seconds running test_commit.
Spent 0.086 seconds in compare_images().
.Spent 3.837 seconds running test_log.
Spent 0.091 seconds in compare_images().
.Spent 5.776 seconds running test_merge.
Spent 0.088 seconds in compare_images().
.Spent 3.617 seconds running test_mv.
Spent 0.086 seconds in compare_images().
.Spent 6.985 seconds running test_rebase.
Spent 0.090 seconds in compare_images().
.Spent 4.160 seconds running test_reset.
Spent 0.086 seconds in compare_images().
.Spent 3.301 seconds running test_restore.
Spent 0.086 seconds in compare_images().
.Spent 3.348 seconds running test_revert.
Spent 0.086 seconds in compare_images().
.Spent 3.518 seconds running test_rm.
Spent 0.086 seconds in compare_images().
.Spent 3.328 seconds running test_stash.
Spent 0.086 seconds in compare_images().
.Spent 3.205 seconds running test_status.
Spent 0.085 seconds in compare_images().
.Spent 5.336 seconds running test_switch.
Spent 0.089 seconds in compare_images().
.Spent 4.065 seconds running test_tag.
Spent 0.091 seconds in compare_images().
If you want to profile the current suite on your system, it's on my profile_tests branch.
There are a couple ways to speed up tests...
The main way to speed up tests is to run them in parallel. pytest has a really good plugin called pytest-xdist which handles all of the overhead of distributing tests for many test suites. It works for our suite. Install pytest-xdist
from PyPI, and then run pytest -n auto
. It will figure out the appropriate amount of distribution to do on your system. On my system, it reduces the total test suite time from 80s to 20s.
Non-parallel output:
$ pytest
============== test session starts ===========================
platform darwin -- Python 3.11.2, pytest-7.4.0, pluggy-1.2.0
rootdir: /Users/eric/projects/git-sim
plugins: xdist-3.3.1
collected 18 items
tests/e2e_tests/test_core_commands.py ..................[100%]
============== 18 passed in 79.52s (0:01:19) =================
Parallel output:
$ pytest -n auto
============== test session starts ===========================
platform darwin -- Python 3.11.2, pytest-7.4.0, pluggy-1.2.0
rootdir: /Users/eric/projects/git-sim
plugins: xdist-3.3.1
10 workers [18 items]
.................. [100%]
============== 18 passed in 22.69s ===========================
The -s
flag doesn't work when running tests in parallel, so if you're examining output it's better not to run the tests in parallel. Also, the benefits of running parallel tests are highly dependent on the system architecture. Some people will see more benefit than others from running parallel tests; some people may see worse performance.
You can organize the test suite into multiple files. For example you could pick a set of core commands you always want to test, and put those in test_core_commands.py. Then other tests go in test_noncore_commands.py. You'd run pytest tests/e2e_tests/test_core_commands.py
most of the time, and then pytest
or pytest -n auto
when you want to run all the tests. Of course it doesn't need to be core and noncore, it could be simple and complex, or however you think about the overall set of commands.
pytest has a way of marking tests, so you could just mark some tests as core tests. Add the decorator @pytest.mark.core
, and then run pytest -m core
. This would only run the tests you've marked as core tests.
I think a ratio of 0.015 sounds perfectly reasonable.
Sweet! I'll bump that up.
The main way to speed up tests is to run them in parallel. pytest has a really good plugin called pytest-xdist which handles all of the overhead of distributing tests for many test suites. It works for our suite. Install pytest-xdist from PyPI, and then run pytest -n auto. It will figure out the appropriate amount of distribution to do on your system. On my system, it reduces the total test suite time from 80s to 20s.
Uhhh..... this is awesome!!! On my PC this speeds things up from 45s to 14s. I think it's a good compromise to gain that kind of speedup by running in parallel, and if something fails to rerun in series to get the output.
Random note on parallelism - I spent a lot of time using the multiprocessing module to try to get git-sim to render individual animations in parallel (by rendering independent segments of each video in parallel) to speed up performance, as I feel that Manim's performance is a major limitation of git-sim.
Unfortunately I couldn't get it working in the end as I had issues passing data (specifically file handles/references) in between processes. It was pretty weird behavior and I was kinda surprised I couldn't get it working - seemed so close. Maybe one day as Python's multiprocessing and pickling tools get more robust...
pytest has a way of marking tests, so you could just mark some tests as core tests. Add the decorator @pytest.mark.core, and then run pytest -m core. This would only run the tests you've marked as core tests.
Good to know!
I have one more thought at the moment, not about performance. You can set custom messages to display when a test fails. For example you're seeing "bad pixel ratio:
msg = f"{path_gen} image comparison failed: {ratio_diff}"
assert ratio_diff < 0.015, msg
Then you'd see that information for failing tests even without using the -s
flag.
A lot of the decisions like this just come down to how you want to use your test suite. What kind of information do you want to see in all test runs? What kind of information do you want to see only when looking at results in detail?
Funny you mention that - as a part of the threshold update I just committed I updated that very same printed message to specify the subcommand that failed:
print(f"bad pixel ratio ({path_ref.stem[8:]}): {ratio_diff}")
😸
On my PC this speeds things up from 45s to 14s.
Do you mind if I ask about the specs on your system? I think I have a reasonably well-specced system, but your tests are twice as fast without parallel, and still significantly faster when run in parallel. I wonder if there's anything about my setup that's slowing things down for me.
Random note on parallelism - I spent a lot of time using the multiprocessing module to try to get git-sim to render individual animations in parallel (by rendering independent segments of each video in parallel) to speed up performance, as I feel that Manim's performance is a major limitation of git-sim.
I've got other things going on for a while so I'm going to stick to a focus on testing for the time being, but I would love to look into this at some point.
Funny you mention that - as a part of the threshold update I just committed I updated that very same printed message to specify the subcommand that failed:
print(f"bad pixel ratio ({path_ref.stem[8:]}): {ratio_diff}")
You might give the assert message a try, because I think this is the only remaining reason to use the -s
flag.
Do you mind if I ask about the specs on your system? I think I have a reasonably well-specced system, but your tests are twice as fast without parallel, and still significantly faster when run in parallel. I wonder if there's anything about my setup that's slowing things down for me.
I upgraded my PC's CPU to the 13600k late last year, runs a lot faster on my PC now than on my Intel mac.
I've got other things going on for a while so I'm going to stick to a focus on testing for the time being, but I would love to look into this at some point.
Oh definitely - I'm grateful for just the testing help this has been incredibly good for the project. Happy to share what I've done on the performance front at some point if your get some extra time!
You might give the assert message a try, because I think this is the only remaining reason to use the -s flag.
Oh gotcha I'll take a look - I missed that assert statement in your note.
Oh gotcha I'll take a look - I missed that assert statement in your note.
To be clear, that would move the assertion from the individual test functions to the compare_images() function. That seems like a slightly better structure anyway.
I'll go quiet for a while, but if you ever reply somewhere here and don't hear from me, feel free to reach out directly over email. I think I'm going to write a newsletter series on Git at some point, and I'll definitely use git-sim for the visualizations for that series.
One more quick thing, you might want to remove the stub file test.py from the top level. It's not causing any problems, but with pytest in use it's just taking up space and it might confuse someone looking at the project.
I think I'm going to write a newsletter series on Git at some point, and I'll definitely use git-sim for the visualizations for that series.
That's awesome. Please drop me a note when it's ready would love to take a look.
One more quick thing, you might want to remove the stub file test.py from the top level.
Oh yeah - I thought about that but decided to leave it as a stub for unit tests. But you're right now that we have the tests/
folder it probably at least makes sense to move it into there...
Closing this since related work to refine the test suite will be handled thru #99
Hi, I have been following this project for some time since its release. Great Work.
But support for several git features coming in, is there any plans or ideas on testing the features. I would like to work on this.
For Manim based tests, we can have a look at the same way manimCE tests their own code. https://docs.manim.community/en/stable/contributing/testing.html and https://github.com/ManimCommunity/manim/tree/main/tests
Hi, @initialcommit-io , is there any test design you have in mind.?
Some of my ideas:
Once done we can add an actions pipeline to ensure continuous integration.
I am working on this, let me know if there are any existing plans on this.
Once again, Thank You for this wonderful project.