cache key for JarClassHasher includes artifact classifier

bjlaub commented 3 months ago

Before this PR

Previously, we used the ModuleVersionIdentifier associated with a ResolvedArtifact as the key for this cache, which contains sha256 hashes of each class file within a jar. Unfortunately this is not sufficient for distinguishing artifacts with the same maven coordinate, but different classifiers, which may contain completely different content within the jars.

This leads to problems when two artifacts from the same maven coordinate but different classifiers are on the classpath, wherein checkClassUniqueness potentially reports incorrect results as it may rely on hashes cached from a different artifact altogether (but for the same ModuleVersionIdentifier).

As a practical example (and where this issue was discovered), the coordinate com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.19 contains both a "normal" jar artifact, and one with the shaded classifier. The latter includes classes that may conflict with other dependencies on the classpath. If JarClassHasher hashes and subsequently caches the classfiles from the "normal" jar first, when we encounter the "shaded" jar we will erroneously used the cached values from the "normal" jar for the shaded one when considering classpath conflicts.

After this PR

This change includes the artifact classifier in the cache key, to distinguish between artifacts with the same ModuleVersionIdentifier but different classifiers, which are actually different artifacts.

==COMMIT_MSG== cache key for JarClassHasher includes artifact classifier ==COMMIT_MSG==

Possible downsides?

This will possibly cause checkClassUniqueness to fail on consumers that may have committed lockfiles under the old behavior, where we potentially missed possible classpath conflicts due to the incorrect caching behavior.

changelog-app[bot] commented 3 months ago

Generate changelog in changelog-dir>`changelog/@unreleased`</changelog-dir

What do the change types mean?

- `feature`: A new feature of the service. - `improvement`: An incremental improvement in the functionality or operation of the service. - `fix`: Remedies the incorrect behaviour of a component of the service in a backwards-compatible way. - `break`: Has the potential to break consumers of this service's API, inclusive of both Palantir services and external consumers of the service's API (e.g. customer-written software or integrations). - `deprecation`: Advertises the intention to remove service functionality without any change to the operation of the service itself. - `manualTask`: Requires the possibility of manual intervention (running a script, eyeballing configuration, performing database surgery, ...) at the time of upgrade for it to succeed. - `migration`: A fully automatic upgrade migration task with no engineer input required. _Note: only one type should be chosen._

How are new versions calculated?

- ❗The `break` and `manual task` changelog types will result in a major release! - 🐛 The `fix` changelog type will result in a minor release in most cases, and a patch release version for patch branches. This behaviour is configurable in autorelease. - ✨ All others will result in a minor version release.

Type

- [ ] Feature - [ ] Improvement - [x] Fix - [ ] Break - [ ] Deprecation - [ ] Manual task - [ ] Migration

Description

cache key for JarClassHasher includes artifact classifier

**Check the box to generate changelog(s)** - [x] Generate changelog entry

felixdesouza commented 3 months ago

I think this makes sense, but I haven't really looked inside this class at all.

I think a lot of the complexity of 1-1 mapping of artifacts goes away here with artifact transforms. Which as an added bonus, can be made incremental. In any case, this is a strict improvement.

Each artifact (which corresponds to ResolvedArtifact) will end up being transformed into its classes, and then you can transform them again to produce a file with hashes in. It will all be at the unit of a ResolvedArtifact. That would be the safest way to do it.

I'm not sure how we get around the corrupted cache issue however as it's not something we can easily fix without automatically adding entries to the lock file (which is undesirable).

In a similar vein, do we not want it to also take into consideration ResolvedArtifact#extension ?

bjlaub commented 3 months ago

FLUP: we should include the classifier in the conflicting maven coordinate names within the lockfile.

bjlaub commented 3 months ago

In a similar vein, do we not want it to also take into consideration ResolvedArtifact#extension ?

maybe, but i have no evidence at the moment that we'd have e.g. two artifacts with the same ModuleVersionIdentifier but different extensions, that both contain class files, that would cause a problem here. I'm inclined to maybe address that in a separate PR if/when it comes up.

bjlaub commented 3 months ago

I've pushed a change to include the classifier in the lock state. Here's what that looks like:

[com.google.cloud.bigdataoss:gcs-connector (classifier=shaded), com.google.cloud.bigdataoss:gcsio]
  - com.google.cloud.hadoop.gcsio.authorization.AuthorizationHandler
[com.google.cloud.bigdataoss:gcs-connector (classifier=shaded), com.google.cloud.bigdataoss:util-hadoop]
  - com.google.cloud.hadoop.util.AccessTokenProvider
  - com.google.cloud.hadoop.util.AccessTokenProvider$AccessToken
  - com.google.cloud.hadoop.util.AccessTokenProvider$AccessTokenType
[com.google.cloud.bigdataoss:gcs-connector (classifier=shaded), com.google.cloud.bigdataoss:util]
  - com.google.cloud.hadoop.util.AccessBoundary
  - com.google.cloud.hadoop.util.AccessBoundary$Action
  - com.google.cloud.hadoop.util.AutoValue_AccessBoundary
[com.google.cloud.bigdataoss:gcs-connector, com.google.cloud.bigdataoss:gcs-connector (classifier=shaded)]
  - com.google.cloud.hadoop.fs.gcs.AutoValue_SyncableOutputStreamOptions
  - com.google.cloud.hadoop.fs.gcs.AutoValue_SyncableOutputStreamOptions$1
  - com.google.cloud.hadoop.fs.gcs.AutoValue_SyncableOutputStreamOptions$Builder
  - com.google.cloud.hadoop.fs.gcs.CoopLockFsck
  - com.google.cloud.hadoop.fs.gcs.CoopLockFsckRunner
  - com.google.cloud.hadoop.fs.gcs.CoopLockFsckRunner$1
  - com.google.cloud.hadoop.fs.gcs.CoopLockFsckRunner$2
  - com.google.cloud.hadoop.fs.gcs.FileSystemDescriptor
  - com.google.cloud.hadoop.fs.gcs.FsBenchmark
  - com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS
  - com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream
  - com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
  - com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase
  - com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase$1
  - com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase$GcsFileChecksum
  - com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase$GcsFileChecksumType
  - com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase$GlobAlgorithm
  - com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase$OutputStreamType
  - com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemConfiguration
  - com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream
  - com.google.cloud.hadoop.fs.gcs.GoogleHadoopSyncableOutputStream
  - com.google.cloud.hadoop.fs.gcs.HadoopConfigurationProperty
  - com.google.cloud.hadoop.fs.gcs.HadoopCredentialConfiguration
  - com.google.cloud.hadoop.fs.gcs.InMemoryGlobberFileSystem
  - com.google.cloud.hadoop.fs.gcs.SyncableOutputStreamOptions
  - com.google.cloud.hadoop.fs.gcs.SyncableOutputStreamOptions$Builder
  - com.google.cloud.hadoop.fs.gcs.auth.AbstractDelegationTokenBinding
  - com.google.cloud.hadoop.fs.gcs.auth.AbstractDelegationTokenBinding$TokenSecretManager
  - com.google.cloud.hadoop.fs.gcs.auth.DelegationTokenIOException
  - com.google.cloud.hadoop.fs.gcs.auth.GcsDelegationTokens
  - com.google.cloud.hadoop.fs.gcs.auth.GcsDtFetcher

(the above was taken from a project that included both the "normal" and shaded classifier artifacts for gcs-connector, which actually was previously not being caught by this task before the most recent change).

bjlaub commented 3 months ago

I'm confident that this is an important fix for correctness reasons (we are not accurately capturing class name conflicts in some situations currently); but we might want to pause rolling this out until we can take stock of how it will affect some internal projects which I know currently have broken lockfiles, and figure out a strategy for managing that first.

CRogers commented 2 months ago

This looks fine but I'd add a test that checks this behaviour so we don't regress in the future.

bjlaub commented 2 months ago

This looks fine but I'd add a test that checks this behaviour so we don't regress in the future.

@CRogers added a test here: https://github.com/palantir/gradle-baseline/pull/2813/commits/849b75a929813f7b7c1f97e906d57bedd60fd721

CRogers commented 2 months ago

:+1:

svc-autorelease commented 2 months ago

Released 5.59.0

palantir / gradle-baseline