Open alex1701c opened 4 days ago
Thanks for raising this issue and coming to the door with working code in a PR!
In this particular case, I don't think I can accept this as-is and wanted to step back and think about what my requirements are. I've written this feature up in an issue (#452). There are three mechanical issues I see with this as-is:
--no-cache
would not be respected for refs--cache-filename
works around this issue for the existing caching)If you're interested in continuing to work on this, I'd really appreciate it if you took a look at that issue. Alternatively, I'm happy to pick this up and run with it, making changes to meet the requirements and adding tests. Cutting 30s runtimes to 2s is a laudable goal; we just need to make sure we don't create problematic scenarios as we do it.
--no-cache would not be respected for refs
def open(self) -> t.Iterator[t.IO[bytes]]:
if (not self._cache_dir) or self._disable_cache:
yield io.BytesIO(self._get_request().content)
We have a separate codepath for that.
multiple refs with the same filename would conflict (--cache-filename works around this issue for the existing caching)
Hmm, I see. But that does not quite work when one wants to cache multiple files. Like one might need to somehow use the complete URL to avoid those conflicts
I also wondered if one might just want to use some in-memory cache. Like there should not be a need to re-read the file from disk.
we need to add tests for this feature, including testing for scenarios like two refs with the same filename but different URIs
Yep
As a sidenote: the current caching does not prevent requests, it is only for writing the file on disk? Like some webservers set cache headers that might be leveraged.
The CacheDownloader
used inside of the ref resolver does not have disable_cache=...
set on init. As a result, it uses the class default. i.e. self._disable_cache
is always False
in the new usage site, even if --no-cache
was passed.
But that does not quite work when one wants to cache multiple files. Like one might need to somehow use the complete URL to avoid those conflicts
The issue in which I outlined a solution discusses this. By taking the full URI and using it to build a filename, I think we can solve this problem.
I also wondered if one might just want to use some in-memory cache. Like there should not be a need to re-read the file from disk.
I'm slightly confused by this comment, given that this PR is about adding on-disk caching.
The ref resolution mechanism provided by referencing
is already holding results in memory.
As a sidenote: the current caching does not prevent requests, it is only for writing the file on disk? Like some webservers set cache headers that might be leveraged.
This seems inaccurate to me? Or if we are doing downloads of the schema even on a cache hit, it would mean there is a bug.
The caching usage today reads the Last-Modified
header and compares against the mtime on disk.
There are lots of other headers (e.g. E-Tags) which could be used, but this was the simple 95% effective solution I started with.
Are you seeing redownloads of schemas even when there is a Last-Modified
header provided? A reproducer would be key to tracking down this sort of bug.
It hit me as soon as I posted that last comment that there is a bug and I even know when/how it got introduced. https://github.com/python-jsonschema/check-jsonschema/issues/453 captures it, but the caching got borked when I added a safety check that the remote content is parseable -- I added it incorrectly and didn't catch it during testing.
I can't work on a fix for it right now, during my workday, but I'll be circling back on it as soon as I have time.
Should I still work on the discussed changes here?
I think it's best if you wait at least until I put together a fix for #453, which I will hopefully be able to start working on in a few hours. The fix will likely refactor some of the existing codepaths, and I wouldn't want you to lose extra time on the resulting merge conflicts.
Although normally I love to share authorship as much as possible, I have a lot of relevant context and some specific ideas about the requisite testing, so I think it will be most efficient for me to make that fix myself. I'll credit you in the changelog for raising the issue, however, since thanks are definitely deserved here!
May I resume working on this or do you expect other conflicitng changes?
We would request the refs without any caching on disk and also do the network request for each file that is loaded. Instead, also cache refs of the given schema.
To avoid more HTTP overhead, cached schemas are not revalidated when they were loaded/revalidated in the same process.
Before:
After: