Add Ability to Provide Custom Logic for Asserting a Match

patbakdev commented 3 years ago

I have a scenario where I need to take and compare a snapshot of string and non string data (specifically EPUB and PDF). I use pytest-snapshot for the string data and have written additional functions to do the non-string comparisons, but I prefer not to have two snapshot solutions and would also like to be able to take advantage of the --snapshot-update ability.

I took a look at what @otosystems was doing with a custom-comparator in https://github.com/otosystems/pytest-snapshot/tree/custom-comparator, but it didn't fit my needs.

So I took a stab at creating an additional approach. You can find an initial proposal at https://github.com/patbakdev/pytest-snapshot/tree/custom-matcher. Basically, my plan is to allow for an optional func to be base to the assert_match methods. By default it would simply pass on to the existing string comparison function so its backwards compatible. I also wrote a simple filecmp function as a proof of concept/example and for use in tests. Presumably more matching functions could be provided in the future.

I don't think this is the best approach, but it was the least impactful. There is an unfortunate coupling between the plugin and the function needed for the update logic. A better approach might be to separate out the logic into a plug-able architecture with methods for snapshot update, comparison, etc., but that would require substantial changes to existing code and more time. Let me know if you are open to making some kind of change here and if so any thoughts you have on the approach/implementation.

Thanks

joseph-roitman commented 3 years ago

Hello, thanks for the suggestion. I guess that the library could also be useful for non-string objects.

We could safely add support for bytes in addition to str without changing the API or much downsides. pytest gives helpful diff messages for bytes, though they wont be as useful as for string data.

We could also add custom comparator logic, something like

@dataclass
class Person:
    name: str
    age: int

def compare(value: Person, expected_value: Person) -> None:
    # This function should raise an exception if the values are different
    assert value.name == expected_value.name and value.age == expected_value.age

def encode(obj: Person) -> bytes:
    # This function should encode the object into bytes, preferably in a human-readable format for good readability.
    return f'Person\n name: {obj.name}\n age: {obj.age}'.encode('utf-8')

def decode(data: bytes) -> Person:
    # This function should reverse the ``encode`` function's operation.
    return ...

snapshot.assert_match(obj, 'my_snapshot.txt', compare=compare, encode=encode, decode=decode)

This would give the users the ability to use pytest-snapshot for almost any type of object. There is one big downside to this solution. The user might encode data into the snapshot that he doesn't compare. For example, if the above compare function only compared the name and not the age. Anyone that looked at the snapshot file would probably assume that the age is also being tested, when in fact, it is not. This is the reason that I would rather avoid supporting custom comparison functions.

So what do you think? Would supporting bytes be good enough? Or do you need custom comparator logic? I could add it, and put it at the bottom of the documentation with a warning of its downsides. Do you have any improvements to the API I specified above?

patbakdev commented 3 years ago

Having a helpful diff is not a big issue in my book. I usually always end up using a proper diff tool. For PDFs I convert and compare each page, but I suppose a byte comparison would work in this case. The most important part is knowing that its different and then I can use DiffPDF or similar tool. EPUB is more difficult since its an archive of text and images. Currently my compare function extracts the files to a temporary directory then compares each one after normalizing one part of the xml that is date based. I like the idea of having the compare function for this, but I also think I could move this logic into the test so that I use pytest-snapshot against each text file (str) and image (byte) individually.

So, if we could get a byte comparison that would be great, a custom comparative would be nice and might be more shareable, but might not be worth changing the API for, has the downside you mentioned, and could be accomplished with more testcase code/logic.

I'm basically just using filecmp for the file compare. The API you have looks good to me (better than what I had). If the only change made is to support bytes, then it would be nice not to have to provide any additional information (auto detect?) or a flag.

Thanks.

joseph-roitman commented 3 years ago

I created branch bytes-snapshots which adds support for bytes and lays the groundwork to easily add custom comparators/encoders/decoders.

As you asked, it auto-detects whether the object is str or bytes but this leads to a troubling edge case. pytest-snapshot will not raise an error if the tested value changes from a string into bytes with the "same" value or vice versa: If a snapshot was created from 'hello' and then the value changed to b'hello', the test would not raise an error.

I am not sure what the ideal solution to this is:

Ignore the edge case.
Save the type of the object that created the snapshot file somewhere. I don't want to save it at the start of the snapshot file since the file is probably a valid json file/pdf/etc.
Create extra functions assert_match_bytes and assert_match_dir_bytes.
Some other solution?

joseph-roitman commented 3 years ago

I am leaning towards the first solution since:

It keeps the API clean and simple.
The library's goal isn't to be used as a single tool to unit test an object. Users can test the object's type themselves.
Trying to save type information can get problematic if we start supporting snapshots for all objects. (builtins, renamed classes, dynamically created types, ...)

What do you think?

patbakdev commented 3 years ago

Agreed. First solution seems acceptable.

I switched over my code to use this branch and was able to reduce my comparison code considerably. I still needed to tweak the built outputs to remove EPUB/PDF metadata before doing the comparison, which meant that I had to extract and compare the individual contents (extract EPUB archive files, convert PDF to individual pages, etc.) and also check for new/missing files, instead of a direct comparison, but I think its reasonable to expect this to be something the user should do.

I think this moves things far enough to be more useful to more people and open up additional scenarios to explore in the future, but allows me to solve for my use case now.

-Thanks.

joseph-roitman commented 3 years ago

In regards to epub metadata that changes every build due to the date, that isn't something that this library can solve. (Perhaps freezegun could help you?

In regards to PDF, why do you still need to convert to individual pages? For easier diffing? Instead of checking for new/missing pages yourself, this is just the case to use assert_match_dir.

I'll merge the branch into the library soon.

patbakdev commented 3 years ago

Thanks for tip. I hadn't heard of freezegun. It won't work for me I don't think because I am shelling out to external tools that are doing the work, but definitely something I'll put in the toolbox for later.

I also was surprised about the PDF failing to compare consistantly, but I think there is some metadata (again, build date) that gets rebuilt into the PDF each time. I'm using pandoc and I don't think I can control that. It might be possible to strip that information out, but anytime I made too many changes to a binary file, the more likely it wont be the same. And I already had the code to convert the PDF to page jpgs.

As far as assert_match_dir, I think I misunderstood the purpose of that method at first, but looking at the code now, I see how it could work, but I think it needs to be updated from values_by_filename: Dict[str, str] to values_by_filename: Dict[str, bytes] unless I am missing something. (I still haven't had my coffee :smile:). If so, this should probably be updated as part of the byte-snapshot change, yes?

patbakdev commented 3 years ago

Actually, I just changed my code to use assert_match_dir and it worked fine. I guess those type hints are really just hints.

joseph-roitman commented 3 years ago

I updated the type hint in the branch. Perhaps you were looking at the master branch type hints?

patbakdev commented 3 years ago

Yes, duh.

Thanks for making these changes btw.

On Mon, Apr 19, 2021 at 1:14 PM, Joseph Roitman @.***> wrote:

I updated the type hint in the branch. Perhaps you were looking at the master branch type hints?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

joseph-roitman commented 3 years ago

I added bytes support in v0.6.0.

Do you still think that there is a real benefit for supporting custom comparisons? I think it is good that you are forced to normalize the snapshot by removing dates. If I saw your snapshot file that contained some hard coded dates, I would lose trust in the snapshot test, thinking that if it ignored the date, then what else does it ignore?

patbakdev commented 3 years ago

I don't worry about the normalization too much. They are limited in number and I try to drop in placeholders that make it very clear what is going on. That in addition to inspecting the diffs closely. But if I were new to a large project, I could see the potential for some confusion. This is just a problem that is hard to avoid with dates being so prevalent in logs, stdout, etc. (But the tool/fixture you mentioned earlier sounds great for reducing that problem).

As for custom comparitors, I could see the benefit for the cases that I have been working with, in as much as having specific PDF and EPUB comparitors that could be shared with others and between project could make things easier. But as you said many people may want to handle that logic themselves to ensure its not doing the wrong thing. I'm trying to think of other scenarios, but between bytes and string, I think a user is pretty much unblocked.

The dataclass scenario is also interesting I think, but that might be beyond the scope of this project. I always looked at snapshots as a quick and dirty way to compare output (files or stdout). Anything more than that and I would likely start doing the comparison myself inside the test code. That's not to say having a library available to do that would not be helpful, but there may already be existing solutions for that and/or that may be a level of scope expansion that doesn't fit with simple snapshots.

So, I guess at this point I'd classify the custom comparitor as a nice to have.

joseph-roitman / pytest-snapshot

Add Ability to Provide Custom Logic for Asserting a Match #29