Binary snapshots - Githubissues

lasernoises commented 4 months ago

This is a relatively basic implementation of binary snapshots. I decided to adapt my implementation mentioned in https://github.com/mitsuhiko/insta/issues/196#issuecomment-2059256019 to handle binary files in general and remove the whole thing that prints images to the terminal.

The basic approach is store a file next to the snapshot file with the provided extension appended. The snapshot macro gets passed a closure with an &mut File parameter. Just using a Vec<u8> would be somewhat simpler in implementation, but I would be worried about memory usage if there's a bunch of tests running with large files.

I decided not to return a Result from the closure, because if there's an I/O error inside it probably makes sense just to panic there because we're inside a test anyway. But maybe it would still make sense to return a Result there just so there is no need for .unwrap() inside the closure?

Currently it directly writes the file at the location where it will end up at the beginning of the assert_snapshot function. This means the binary file is created as snap.new.$extension much sooner than the metadata is saved because we need the file to be able to compare it. What is a bit weird is if there is a crash somewhere between overwriting the binary file and storing the .snap file it will leave the two files in an inconsistent state. Creating it as .snap.new.$extension.tmp and then moving it when saving the metadata might be a good alternative approach.

There is also a problem if the extension changes between two runs where the old pending binary file will stay around. It doesn't seem to a problem in practice though because when running with cargo insta test because cargo-insta removes the old pending snapshots first. Maybe it would make sense to also do that generally at the start of assert_snapshot or alternatively cleanup all .snap.new.* somewhere.

The review TUI looks like this:

2024-05-02_15:36:43

2024-05-02_15:58:36

The open action opens both files (or just one if there's no existing snapshot, or the existing one is text) in an external application using the open crate.

The file paths are using the OSC-8 escape sequence (described here) to link to the files. Here it says that file:/// without a hostname should be avoided, but at least my terminal (WezTerm) seems to support it without a hostname and so I left that out for now since it would mean an extra dependency on a crate to get the hostname. As an alternative we could also just print the path including file:// to the terminal since many terminals seem to support clicking on those as well.

mitsuhiko commented 4 months ago

Creating it as .snap.new.$extension.tmp and then moving it when saving the metadata might be a good alternative approach.

I like this in theory. What might make this whole thing more robust would be to say that the following has to be true:

foo.snap.ANY_EXT has to be unique. On review/write if more than one extension is found, other files are deleted. A not yet fully persisted file is temporarily stored as foo.snap._ to mark it as temporary. If it's left over after a run, it would be cleaned up like any other duplicate extension.

The open action opens both files (or just one if there's no existing snapshot, or the existing one is text) in an external application using the open crate

It would be interesting to also add a p for preview on macos which would open quick-look on supported files. Not sure if an equivalent exists on windows. But this could be done in a followup.

lasernoises commented 3 months ago

I've made some adjustments:

As you suggested I've changed it so the contents are first written to a temporary file with the .snap._ extension.

At the start of the assert_snapshot function and after accepting and rejecting in review any extra files are now cleaned up.

I had to change the Snapshot::save* methods to change &mut self so they can change the path to the newly saved path. Which makes it a little bit weird because it's not necessarily clear from the name that those functions will also move a file now.

This problem could also be solved in different ways: Instead of moving the file we could copy it and then always regenerate the binary file path when needed. But that would probably mean an extra parameter so it knows where to copy the contents file from. Or maybe save_new could make the assumption that there's a temporary file in .snap._.

Also the .new and ._ extensions and extensions starting with .new.* are now forbidden so there can't be conflicts.

Maybe I can add the quick-look thing for macos soon, but I'll have to borrow a mac from a colleague to test it. Also I'm not sure if opening both files in quick-look makes sense. Maybe two different shortcuts makes more sense.

max-sixty commented 3 months ago

While we wait for a more informed review, one thought (not confident): I notice we add new enums — would there be any case for collapsing "Inline | Named" enums and the new "String | Binary" enums into a single enum? Every Binary snapshot must be Named. But I'm not sure if that makes the structure simpler...

Edit: one low-hanging fruit as an example — the proposed code has SnapshotType with String | Binary. We currently don't encode whether it's an Inline snapshot in that struct, we just check whether it has a name — we could instead add Inline as an option there.

lasernoises commented 3 months ago

While we wait for a more informed review, one thought (not confident): I notice we add new enums — would there be any case for collapsing "Inline | Named" enums and the new "String | Binary" enums into a single enum? Every Binary snapshot must be Named. But I'm not sure if that makes the structure simpler...

Edit: one low-hanging fruit as an example — the proposed code has SnapshotType with String | Binary. We currently don't encode whether it's an Inline snapshot in that struct, we just check whether it has a name — we could instead add Inline as an option there.

I suppose the refval thing is one of those cases. I could change the SnapshotContent enum to something like this:

type SnapshotName<'a> = Option<Cow<'a, str>>;

pub enum SnapshotValue<'a> {
    String {
        name: SnapshotName<'a>,
        content: &'a str,
    },

    InlineString {
        reference_content: &'a str,
        content: &'a str,
    },

    Binary {
        name: SnapshotName<'a>,
        write: &'a mut dyn FnMut(&mut File),
        extension: &'a str,
    },
}

or maybe

type SnapshotName<'a> = Option<Cow<'a, str>>;

pub enum ReferenceValue<'a> {
    Named(SnapshotName<'a>),
    Inline(&'a str),
}

pub enum SnapshotValue<'a> {
    String {
        refval: ReferenceValue<'a>,
        content: &'a str,
    },

    Binary {
        name: SnapshotName<'a>,
        write: &'a mut dyn FnMut(&mut File),
        extension: &'a str,
    },
}

I'm a bit less sure about SnapshotType. I don't think we need to represent the inline case there because the metadata isn't being stored in that case if I understand correctly.

max-sixty commented 3 months ago

Yes those both seem reasonable, though on reflection not obvious changes either.

(I note we're handling the contents differently between named string and named binary snapshots — string snapshots we hold the value in memory while binary snapshots we hold only the path, which leads to some of the split. Maybe that's because we expect the binary snapshots to be larger?)

I'm a bit less sure about SnapshotType. I don't think we need to represent the inline case there because the metadata isn't being stored in that case if I understand correctly.

(edit) — I tried creating a SnapshotType on the existing master; it doesn't pull its weight since the Snapshot is largely used similarly across types.

lasernoises commented 3 months ago

(I note we're handling the contents differently between named string and named binary snapshots — string snapshots we hold the value in memory while binary snapshots we hold only the path, which leads to some of the split. Maybe that's because we expect the binary snapshots to be larger?)

Yes, the reason for storing them directly in files is the size. I was concerned about large files, especially when multiple tests are running concurrently. But this is definitely a tradeoff since passing around a Vec<u8> would make the code simpler and more uniform. I also think and it's unclear whether creating huge binary snapshots is a realistic usecase especially if you want to store them in git.

max-sixty commented 3 months ago

But this is definitely a tradeoff since passing around a Vec<u8> would make the code simpler and more uniform. I also think and it's unclear whether creating huge binary snapshots is a realistic usecase especially if you want to store them in git.

Yeah. And IIUC we are storing them in memory when writing them at the moment. (+ agree given they're in git, the likelihood of multi-GB values seems very remote...)

max-sixty commented 2 months ago

@lasernoises I guess we're still waiting on a review.

Not a verdict, but on further reflection, I do think simplifying the code to load snapshots into memory would make sense. The prospect of multi-GB sized snapshots seems quite unlikely — they're stored in git, the computational cost of diffing snapshots of that size will likely bind before the memory size becomes an issue. Even before your changes, the code handling different types of snapshots is a bit complicated.

(But ofc no need to make changes based on my opinion)

lasernoises commented 1 month ago

Sorry for replying so late.

Yeah. And IIUC we are storing them in memory when writing them at the moment. (+ agree given they're in git, the likelihood of multi-GB values seems very remote...)

Because it's a callback that get's an &mut File it doesn't have to ever be in memory as a whole thing. For example in my usecase, which is for PDFs the PDF as an array of bytes is never fully in memory, only an intermediate datastructure that can then be written out as bytes.

@lasernoises I guess we're still waiting on a review.

Not a verdict, but on further reflection, I do think simplifying the code to load snapshots into memory would make sense. The prospect of multi-GB sized snapshots seems quite unlikely — they're stored in git, the computational cost of diffing snapshots of that size will likely bind before the memory size becomes an issue. Even before your changes, the code handling different types of snapshots is a bit complicated.

(But ofc no need to make changes based on my opinion)

I think you're right and that for real usecases you'll want to minimize size anyways. In my PDF case I avoid embedding fonts for example and use the builtin fonts instead.

I think I'll try to take some time next week to try it with the binaries in memory.

max-sixty commented 1 week ago

Thanks for resolving the merge conflicts! That must have been a decent amount of work...

One note — I've recently spent lots of time simplifying the internals, and I'd be very hesitant to merge the approach that uses files rather than in-memory values — I think that brings another layer of complication without much benefit. (at least — I want to understand whether extremely large snapshots are really being used, such that it's worth maintaining the code for two different approaches)...

lasernoises commented 1 week ago

Thanks for resolving the merge conflicts! That must have been a decent amount of work...

One note — I've recently spent lots of time simplifying the internals, and I'd be very hesitant to merge the approach that uses files rather than in-memory values — I think that brings another layer of complication without much benefit. (at least — I want to understand whether extremely large snapshots are really being used, such that it's worth maintaining the code for two different approaches)...

Yes, my plan is still to try it with the binary in memory (it just took me way longer to get to it than I thought..). I just decided to still try basing it on the previous work on this branch because after looking over the changes again it seems that a bunch of them will still be needed.

Although, maybe for a clean history and preventing unnecessary changes I now think I'll try to start clean. (Also because there's already more conflicts...)

max-sixty commented 1 week ago

OK great! Hopefully it could use at least some of the existing code...

One note is that a couple of outstanding PRs change the SnapshotContents struct slightly, adding a kind field: https://github.com/mitsuhiko/insta/pull/581/files#diff-2666420e8389f1d842ce108018b05f3f68fa4b920f8c1793a3d4d01f21443544R549-R553.

We could adjust that to make SnapshotContents an Enum itself if the behavior becomes more different per kind, like you've done here, or delegate more behavior to that kind.

I'd like to merge that ASAP, though waiting on feedback from @mitsuhiko

lasernoises commented 1 week ago

I ended up getting relatively far today. I currently have the new version at: https://github.com/lasernoises/insta/tree/in-memory-binary-snapshots. There's still some smaller stuff I need to take care of and then I'll probably have some questions about some of the details. (I'll also need to take a more detailed look at #581 and see how compatible that one is with my changes.)

But first: should I create a new PR and close this one or reset this branch to the other one?

max-sixty commented 1 week ago

But first: should I create a new PR and close this one or reset this branch to the other one?

Totally as you wish!

lasernoises commented 1 week ago

I decided to open a new PR because the implementation is sufficiently different and it might be nice to be able to easily compare them.

mitsuhiko / insta

Binary snapshots #489