Excessive reparsing of `ambr` files in tests with large snapshots

StephanMeijer commented 7 months ago

Describe the bug

We are encountering a problem in our testing process where we handle very large DOCX/XML files. Our test setup involves using the syrupy library to create snapshots. Here's a snippet of our test code:

import lxml.etree
import pytest
from syrupy import snapshot

from nldoc_p4_docx.docx_processing.docx.document import Document
from tests.conftest import all_fixture_paths, matcher

@pytest.mark.parametrize('xml', all_fixture_paths('docx/documents'), indirect=True)
def test_loading_documents(snapshot, xml: lxml.etree.Element):
    assert Document(xml) == snapshot(matcher=matcher)

While generating large snapshot files is not a problem in itself, we are facing an issue where the ambr file is completely reparsed for each test run, even though we only need a snapshot for the specific fixture being tested.

To reproduce

Run a test that uses large fixtures, which results in the creation of large snapshot files.

Expected behavior

We expect that testing a large fixture would slow down only that specific test, rather than affecting the performance of all tests that use the same snapshot file. A potential solution might be to have a setting that allows splitting snapshot files with different names into separate files.

Environment (please complete the following information):

OS: macOS, Darwin 23.1.0
Syrupy Version: 4.6.0
Python Version: 3.11.6

Additional context

Our repository is available at: https://gitlab.com/toegang-voor-iedereen/conversion/preprocessing/docx-preprocessor

StephanMeijer commented 7 months ago

I saw a huge increase in performance after splitting my tests in separate files,but it still seems to take a long time for syrupy to parse.

StephanMeijer commented 7 months ago

As also mentioned in #119, with this extension I reduced the time my tests took from around 200 seconds to around 5.

class SingleFileAmberSnapshotExtension(SingleFileSnapshotExtension):
    def serialize(self, data, **kwargs):
        return AmberDataSerializer.serialize(data, **kwargs).encode()

StephanMeijer commented 7 months ago

Got things working perfectly with the following snipped:

class SingleFileAmberSnapshotExtension(SingleFileSnapshotExtension):
    _file_extension = 'ambr'
    _write_mode = WriteMode.TEXT

    def serialize(self, data, **kwargs):
        return AmberDataSerializer.serialize(data, **kwargs)

This means I am not going to invest any more time in this ticket. If no objections, this ticket can be closed.

noahnu commented 7 months ago

Let's keep the issue open. We shouldn't be parsing the snapshot file more than once. There's a cache. It must be getting invalidated. Not sure when I'll have time to dig into this but will loop back to this when I have a chance.

StephanMeijer commented 7 months ago

Let's keep the issue open. We shouldn't be parsing the snapshot file more than once. There's a cache. It must be getting invalidated. Not sure when I'll have time to dig into this but will loop back to this when I have a chance.

That is interesting.. As the performance difference is about a factor of a hundred.

tophat / syrupy

Excessive reparsing of `ambr` files in tests with large snapshots #837