Fix performance issues when having remote schema with multiple refs

alex1701c commented 4 days ago

We would request the refs without any caching on disk and also do the network request for each file that is loaded. Instead, also cache refs of the given schema.

To avoid more HTTP overhead, cached schemas are not revalidated when they were loaded/revalidated in the same process.

Before:

time check-jsonschema --verbose --schemafile https://autoconfig.kde.org/jsonschemas/_combined.schema.json plugins/**.json
ok -- validation done
The following files were checked:
  ....
  plugins/virtualmonitor/kdeconnect_virtualmonitor.json

________________________________________________________
Executed in   31.14 secs    fish           external
   usr time   19.73 secs    5.08 millis   19.72 secs
   sys time    0.28 secs    0.01 millis    0.28 secs

After:

time python3 ~/projects/check-jsonschema/src/check_jsonschema/__main__.py --verbose --schemafile https://autoconfig.kde.org/jsonschemas/_combined.schema.json plugins
/**.json
ok -- validation done
The following files were checked:
  ...
  plugins/virtualmonitor/kdeconnect_virtualmonitor.json

________________________________________________________
Executed in    2.23 secs    fish           external
   usr time    1.43 secs    4.72 millis    1.43 secs
   sys time    0.08 secs    0.09 millis    0.08 secs

sirosen commented 3 days ago

Thanks for raising this issue and coming to the door with working code in a PR!

In this particular case, I don't think I can accept this as-is and wanted to step back and think about what my requirements are. I've written this feature up in an issue (#452). There are three mechanical issues I see with this as-is:

--no-cache would not be respected for refs
multiple refs with the same filename would conflict (--cache-filename works around this issue for the existing caching)
we need to add tests for this feature, including testing for scenarios like two refs with the same filename but different URIs

If you're interested in continuing to work on this, I'd really appreciate it if you took a look at that issue. Alternatively, I'm happy to pick this up and run with it, making changes to meet the requirements and adding tests. Cutting 30s runtimes to 2s is a laudable goal; we just need to make sure we don't create problematic scenarios as we do it.

alex1701c commented 3 days ago

--no-cache would not be respected for refs

    def open(self) -> t.Iterator[t.IO[bytes]]:
        if (not self._cache_dir) or self._disable_cache:
            yield io.BytesIO(self._get_request().content)

We have a separate codepath for that.

multiple refs with the same filename would conflict (--cache-filename works around this issue for the existing caching)

Hmm, I see. But that does not quite work when one wants to cache multiple files. Like one might need to somehow use the complete URL to avoid those conflicts

I also wondered if one might just want to use some in-memory cache. Like there should not be a need to re-read the file from disk.

we need to add tests for this feature, including testing for scenarios like two refs with the same filename but different URIs

Yep

As a sidenote: the current caching does not prevent requests, it is only for writing the file on disk? Like some webservers set cache headers that might be leveraged.

sirosen commented 3 days ago

The CacheDownloader used inside of the ref resolver does not have disable_cache=... set on init. As a result, it uses the class default. i.e. self._disable_cache is always False in the new usage site, even if --no-cache was passed.

But that does not quite work when one wants to cache multiple files. Like one might need to somehow use the complete URL to avoid those conflicts

The issue in which I outlined a solution discusses this. By taking the full URI and using it to build a filename, I think we can solve this problem.

I also wondered if one might just want to use some in-memory cache. Like there should not be a need to re-read the file from disk.

I'm slightly confused by this comment, given that this PR is about adding on-disk caching. The ref resolution mechanism provided by referencing is already holding results in memory.

As a sidenote: the current caching does not prevent requests, it is only for writing the file on disk? Like some webservers set cache headers that might be leveraged.

This seems inaccurate to me? Or if we are doing downloads of the schema even on a cache hit, it would mean there is a bug. The caching usage today reads the Last-Modified header and compares against the mtime on disk. There are lots of other headers (e.g. E-Tags) which could be used, but this was the simple 95% effective solution I started with.

Are you seeing redownloads of schemas even when there is a Last-Modified header provided? A reproducer would be key to tracking down this sort of bug.

sirosen commented 3 days ago

It hit me as soon as I posted that last comment that there is a bug and I even know when/how it got introduced. https://github.com/python-jsonschema/check-jsonschema/issues/453 captures it, but the caching got borked when I added a safety check that the remote content is parseable -- I added it incorrectly and didn't catch it during testing.

I can't work on a fix for it right now, during my workday, but I'll be circling back on it as soon as I have time.

alex1701c commented 3 days ago

Should I still work on the discussed changes here?

sirosen commented 3 days ago

I think it's best if you wait at least until I put together a fix for #453, which I will hopefully be able to start working on in a few hours. The fix will likely refactor some of the existing codepaths, and I wouldn't want you to lose extra time on the resulting merge conflicts.

Although normally I love to share authorship as much as possible, I have a lot of relevant context and some specific ideas about the requisite testing, so I think it will be most efficient for me to make that fix myself. I'll credit you in the changelog for raising the issue, however, since thanks are definitely deserved here!

alex1701c commented 17 hours ago

May I resume working on this or do you expect other conflicitng changes?

python-jsonschema / check-jsonschema

Fix performance issues when having remote schema with multiple refs #451