spdx / Spdx-Java-Library

Java library which implements the Java object model for SPDX and provides useful helper functions
Apache License 2.0
32 stars 33 forks source link

Enhancement: caching version of org.spdx.storage.listedlicense.SpdxListedLicenseWebStore and associated classes #193

Closed pmonks closed 9 months ago

pmonks commented 9 months ago

I'm using org.spdx.storage.listedlicense.SpdxListedLicenseWebStore as a way to access all of the listed license and listed exception information (including templates etc.), and am finding it quite slow to download everything each time it runs - over a minute on my laptop. This is obviously a problem for short-lived processes (such as unit tests) that use SPDX listed license and exception information.

In profiling the code I see there's an opportunity for a substantial performance saving by caching all of the files the SpdxListedLicenseWebStore and other supporting classes download locally, and subsequently reading them from the local cache (if present), instead of re-downloading them each time.

Proposed Solution

Implement a new opt-in version of org.spdx.storage.listedlicense.SpdxListedLicenseWebStore that utilises a local cache. This would involve:

  1. When files are downloaded, cache them locally on disk*, along with ETag metadata for the license and exception list JSON files.
  2. On subsequent invocations, check if the cache exists, and if so read the ETag metadata and make conditional If-None-Match HTTP requests for the license and exception list JSON files, using the ETags from the metadata files saved in step 1. Basically we use these two JSON files as a proxy for whether any of the online data has changed and therefore the local cache is stale.
  3. If either status code is 200, wipe the cache, and rebuild it (including ETag metadata files) as in step 1.
  4. If both status codes are 304, read the files from cache instead of downloading them.
Footnotes

* the XDG Base Directory Specification should be followed here, so the cache directory might be determined using code such as:

final String cacheDir = (System.getenv("XDG_CACHE_HOME") == null ? System.getProperty("user.home") + "/.cache" : System.getenv("XDG_CACHE_HOME")) + "/Spdx-Java-Library";
goneall commented 9 months ago

Thanks @pmonks for the suggested solution. I've also run into this performance problem and would welcome a solution.

If we could make the opt-in version a superclass of the existing webstore and just add caching it would be a pretty straight-forward implementation - we would need to investigate a bit more to see if this approach would work.

Once implemented, we could add an option or environment variable for tools-java to use the cache - it would speed up the command line utility quite a bit.

pmonks commented 9 months ago

we would need to investigate a bit more to see if this approach would work.

FWIW I implemented exactly this kind of caching before I discovered Spdx-Java-Library, and it did result in a pretty significant speed up. And that was despite employing some optimised download tricks (like using svn to get all of the license template files from GitHub - that's faster than using one-at-a-time HTTP requests). That said, I was actually caching downloaded-and-parsed JSON files, not just the raw downloaded JSON files themselves (so saving both download cost and parse cost for each cache hit) - in my testing parse cost was low compared to download cost, but then I was using completely different libraries to Spdx-Java-Library for parsing so I'm not sure if there's much to be learned from that.

Am I right in thinking that this enhancement might be possible solely via internal implementation changes within the org.spdx.storage.listedlicense.SpdxListedLicenseWebStore.getUrlInputStream() method? It seems like all of the various file downloads ultimately get funneled through that method, so that might be the right place to add this kind of functionality.

goneall commented 9 months ago

Am I right in thinking that this enhancement might be possible solely via internal implementation changes within the org.spdx.storage.listedlicense.SpdxListedLicenseWebStore.getUrlInputStream() method?

I think that should work. I do want to make this optional with the default being uncached - there are some scenerios where having a cache could introduce security vulnerabilities.

pmonks commented 9 months ago

Oh yeah I agree with making it optional - was more just trying to understand the effort involved in implementing this. If it's just internal implementation of a single method I can probably give it a go, despite my very rusty Java.

goneall commented 9 months ago

@pmonks - I think your implementation approach should work, so feel free to create a PR.

We should also add documentation to the README on the caching and how to enable / disable it.

pmonks commented 9 months ago

So before I launch into this, I wanted to discuss some micro-design questions with you @goneall:

  1. Are you ok if a Java property is used to enable/disable this cache? That's easier for downstream code to control than, say, an OS-level environment variable, while still being easy to adjust by a human operator at JVM startup time (via the -D command line argument provided by the java executable).
  2. Any preferences on a name for that Java property? Perhaps something like org.spdx.storage.listedlicense.SpdxListedLicenseWebStore.enableCache that's a boolean value (and defaults to false if unset)?
  3. In order to preserve the signature of org.spdx.storage.listedlicense.SpdxListedLicenseWebStore.getUrlInputStream(), cached objects will need to be stored with a "key" that is a 1:1 mapping of the URL parameter that's passed in. It can't be the URL itself however, as that may contain characters that are not allowed in filenames. String hashcodes are also insufficient for this (Java's implementation of String hashing is quite "collisiony"), so my thought would be a BASE64 representation of the URL. For example, a request to https://spdx.org/licenses/licenses.json would create a cached file on disk called aHR0cHM6Ly9zcGR4Lm9yZy9saWNlbnNlcy9saWNlbnNlcy5qc29u. Thoughts?
  4. To support ETag requests, there'll also need to be a separate "metadata" file alongside the content file itself. Are you ok if this file is in JSON format, and has the same name as the cached content file, but with the additional suffix .metadata.json? So for example when https://spdx.org/licenses/licenses.json is cached, there would also be a aHR0cHM6Ly9zcGR4Lm9yZy9saWNlbnNlcy9saWNlbnNlcy5qc29u.metadata.json file in the same directory, containing the ETag value, the date/time of the download, and the source URI (these latter two data elements are optional, but in my experience are very useful in troubleshooting cache bugs).
  5. Any thoughts on caching the results of parsing these downloaded files too, not just the raw downloaded content? I have no idea if the object graph created out of the downloaded JSON files is Serializable or not, but if it is, that might be an additional performance win. We could also punt this to a later PR and see how much improvement we get just from caching downloads alone (I expect it will be quite substantial). I suspect this is a more intrusive change than just caching downloads, from my rudimentary knowledge of the code paths, so I would probably lean towards punting.
pmonks commented 9 months ago

@goneall I went ahead with a first cut at this, based on my thoughts on each of the above 5 points - you can find it here. Would love any/all comments you might like to share!

You can test this branch out with the command:

$ mvn -Dorg.spdx.storage.listedlicense.SpdxListedLicenseWebStore.enableCache=true test -Ptest
goneall commented 9 months ago

@pmonks - sorry for the late reply - really busy day yesterday.

I'll take a look at the code later today - but I do want to point you to some existing code to handle properties you can leverage: https://github.com/spdx/Spdx-Java-Library/blob/ae14e246866472e7a356ba934f28efcda61b5226/src/main/java/org/spdx/library/model/license/ListedLicenses.java#L77

It solves a similar problem where we have a property to force using a local copy of the licenses - it was requested by someone who wanted to use the library totally offline (air gap security).

I forgot about that implementation until you raised the questions above.

pmonks commented 9 months ago

sorry for the late reply - really busy day yesterday.

@goneall absolutely no problem at all! None of this work is urgent on my end, so please don't feel compelled to even reply until you have an appropriate time to look at this!

It solves a similar problem where we have a property to force using a local copy of the licenses - it was requested by someone who wanted to use the library totally offline (air gap security).

I might be misunderstanding, but that sounds like a different use case to mine. I do want to use the online versions of the various files (rather than the ones baked into the library's JAR file), but I want them cached locally on disk so that the download cost is only incurred once per SPDX license list version (rather than once per JVM invocation, as is currently the case).

Separately (and unrelated to this branch), I've noticed that some URLs are downloaded repeatedly during the unit tests. Does that suggest there's a bug in the in-memory storage of this downloaded content (i.e. something doesn't realise that that content is already downloaded, and therefore downloads it again)? It seems to mostly (only?) be the JSON files for individual licenses and exception (e.g. https://spdx.org/licenses/Apache-2.0.json is downloaded several times, redundantly). While this new caching logic mitigates that bug, I'm pretty certain there would be a benefit to fixing it in the non-caching code paths too. Perhaps I should raise this in a separate issue?

goneall commented 9 months ago

I might be misunderstanding, but that sounds like a different use case to mine.

Agree - different use case, but I think you can use the same infrastructure for properties since they both involve how we store and access the license information - e.g., add an additional property to the property file.

Separately (and unrelated to this branch), I've noticed that some URLs are downloaded repeatedly during the unit tests.

I don't think the individual JSON files are cached, so it may re-fetch every time. I'm not sure if I could call it a bug - but it certainly could be optimized.

pmonks commented 9 months ago

Ah ok - I'll check that code out. That said, property files are difficult to manipulate from downstream code that wants to programmatically set these behaviours (rather than relying on a human operator to do so), so that rings little alarm bells for me.

A more typical way to do this is to use Java properties, or (better yet) provide "init" type methods (or even constructor args) that accept these options - those are also a lot more amenable for use in DI frameworks (such as Spring).

pmonks commented 9 months ago

Latest update adds a "Configuration options" section to the readme, and also enables control over how often each cache entry gets rechecked for staleness (via an ETag request).

goneall commented 9 months ago

@pmonks Thanks for pulling together the caching implementation.

I took a look at the proposed changes and the overall approach looks good.

I have a couple of suggestions:

I'm not sure if we need to add any locks around the caching - the class should support multi-threading and we are accessing some share metadata files, but I can't think of any scenarios where we would have a damaging race condition.

If we want to be safe and add locking, we can use the readLock and writeLock from the superclass SpdxListedLicenseModelStore.

goneall commented 9 months ago

BTW - I looked into the previously mentioned properties file implementation for the offline license model store. It turns out that it does accept system property SPDXParser.OnlyUseLocalLicenses which takes precedence over the property file. Here's the code that implements the system property: https://github.com/spdx/Spdx-Java-Library/blob/ae14e246866472e7a356ba934f28efcda61b5226/src/main/java/org/spdx/library/model/license/ListedLicenses.java#L65C3-L65C3

It's been a while, but I think the reason we used both approach is to allow someone to make a change in the local properties file and not have to always specify a system property when running.

I should probably document this in the README similar to how you documented the caching properties.

pmonks commented 9 months ago

If the XDG_CACHE property isn't there, have the default directory be more specific to SPDX to avoid possible conflicts with other files (e.g. change System.getProperty("user.home") + "/.cache" to System.getProperty("user.home") + "/.spdx-license-cache"

The behaviour I've implemented is faithful to the XDG Base Directory Specification, which says that if the XDG_CACHE_HOME environment variable is not set, applications should default to ${HOME}/.cache/<appropriate subdirectory name specific to that app>. I personally would rather we adhered to that specification (all of it), rather than surprise users by doing something bespoke. Wdyt?

If there is any failure in setting up the caches, log a warning and revert to the non-cached - e.g. in the exception handler for creating the directories, set cacheEnabled to false and log the warning)

✅ - I'll make that change now.

We can shorten the property name from org.spdx.storage.listedlicense.SpdxListedLicenseWebStore.enableCache to org.spdx.storage.listedlicense.enableCache

✅ - I'll make that change now.

Also, interesting trivia from my testing this morning: enabling the cache avoids downloading 613 files totaling ~20MB on each use of the library, which (on my laptop) takes the initialisation time from ~4 minutes down to ~1 second.

pmonks commented 9 months ago

@goneall the second and third changes are ready for review.

pmonks commented 9 months ago

BTW - I looked into the previously mentioned properties file implementation for the offline license model store. It turns out that it does accept system property SPDXParser.OnlyUseLocalLicenses which takes precedence over the property file.

Ok cool - I'll go look into that now and try to replicate it. As long as downstream code can programmatically modify the behaviour, I'm good with whatever approach you decide we should go with.

It's been a while, but I think the reason we used both approach is to allow someone to make a change in the local properties file and not have to always specify a system property when running.

Yeah makes sense. Putting this in constructor args / setters and then punting the specifics of how those methods get called right out of the library is probably ideal going forward - that's the most DI-friendly approach, and lets DI users leverage all of the weird and wonderful mechanisms DI frameworks support for setting these kinds of configuration options (which go well beyond just properties files and Java system properties). That's probably a longer term / bigger refactor type of activity though, so I'd probably prefer not to tackle it as part of this cache work.

I should probably document this in the README similar to how you documented the caching properties.

I was wondering whether the GitHub project might benefit from a wiki for this kind of thing? While the README is great, it might become unwieldy as we start adding this kind of information, and a wiki lets us start to structure the information into separate categories / pages (e.g. "usage guide", "configuration guide", etc.).

goneall commented 9 months ago

The behaviour I've implemented is faithful to the XDG Base Directory Specification, which says that if the XDG_CACHE_HOME environment variable is not set, applications should default to ${HOME}/.cache/. I personally would rather we adhered to that specification (all of it), rather than surprise users by doing something bespoke. Wdyt?

Actually - the code there is fine. I missed the enclosing parenthesis around the conditional - I originally thought it was just defaulting to ~/.cache but it is actually defaulting to ~/.cache/Spdx-Java-Library.

I did notice one new issue when looking at this.

Your should change the hard coded file separators to system dependent separators - you can use Java Path or replace the / with File.separator.

pmonks commented 9 months ago

Your should change the hard coded file separators to system dependent separators - you can use Java Path or replace the / with File.separator.

Good catch, and now fixed! It's clearly been too long since I had to use Windows. ;-)

pmonks commented 9 months ago

@goneall I've taken a look at how the existing properties configuration logic functions, and think it will become pretty messy to extend it in place to support the WebStore. Instead, I think we should create a separate, dedicated configuration management class that any other classes (org.spdx.library.model.license.ListedLicenses, org.spdx.storage.listedlicense.SpdxListedLicenseWebStore, and anything else that might need configuration in the future) can interrogate to get their configuration. I'd propose org.spdx.Configuration.java for this - thoughts?

goneall commented 9 months ago

Completely agree with refactoring out the config into a separate class. Your proposed naming sounds good.

pmonks commented 9 months ago

Ok I think that change is now done too. Let me know if you have any feedback on it! I also merged in the latest changes to master, to make this easier to merge once I create a PR for it.

goneall commented 9 months ago

I'm thinking there may be a cleaner way to handle the confusing different prefixes on the system properties.

In the properties file, we could just use a simple unqualified string (e.g. OnlyUseLocalLicenses) which would be used in the properties file and add a parameter prefix which would be prepended just for the system property only - that way the properties file would be more concise and we would just use different prefixes for the two different calls.

Let me know what you think.

BTW - the old prefix SPDXParser was the name of the java package years ago - we just kept it in the property name for compatibility. It would be nice to change it to match the current code, but I'm afraid it may break some implementations.

pmonks commented 9 months ago

My sense is that having "smart" logic like that may actually be confusing for an implementer who's looking at someone's configuration i.e. it wouldn't be obvious that OnlyUseLocalLicenses in the properties file is the same as SPDXParser.OnlyUseLocalLicenses (or ideally eventually org.spdx.storage.listedlicense.OnlyUseLocalLicenses) specified on the command line.

Then there's also the possibility of possible conflicts in the file itself. If some new class elsewhere in the library also has a configuration property called OnlyUseLocalLicenses, what would we do then? At least requiring them to be fully qualified in the properties file provides a way to namespace them.

And then there's the bigger question of potentially ditching all of this anyway in favour of a more DI-compatible approach (which wouldn't use system properties or property files at all - that gets externalised out of the library and becomes the sole responsibility of the DI framework). IOW how much do we want to invest in code that might eventually go away anyway?

tl;dr - while I'm ok with keeping that one property special cased (to preserve backwards compatibility), I think going forward we should discourage that kind of thing

goneall commented 9 months ago

All good points - and I agree. If you can create a PR, I'll take one more review before merging.

pmonks commented 9 months ago

@goneall ok PR created, after some more refactoring and a lot of polishing of JavaDocs. 😉

Probably the biggest change since you last looked was that I took the liberty of refactoring out the entire DownloadCache into its own class (a bit like the Configuration class) - this makes both the cache logic and the WebStore logic a lot more self-explanatory, and also opens up the possibility of using the DownloadCache for any other classes elsewhere in the library that might retrieve content from URLs (though there aren't any other examples of that at this point, as best I can tell).

pmonks commented 9 months ago

This was fixed in PRs #205 and #206.