sbt / io

IO module for sbt
Apache License 2.0
41 stars 47 forks source link

`Hash.apply` provides the same hash for all directories #339

Open Baccata opened 2 years ago

Baccata commented 2 years ago

See here.

When a file is a directory, you get a FileNotFoundException when instantiating a FileInputStream with it. This makes it harder to write SBT tasks that need to invalidate some cache upon a file being added/removed from a directory.

eed3si9n commented 2 years ago

Not sure if I have the full context on what situation would require directory hash, but as a reference Bazel also behaves in a similar way wherein if you use directory as input or output you get a warning saying "we're not gonna invalidate correctly for directory" or something along the lines.

See also https://www.scala-sbt.org/1.x/docs/Caching.html#Tracking+file+attributes and other example regarding caching.

Baccata commented 1 year ago

Sorry for the late reply.

Not sure if I have the full context on what situation would require directory hash.

There's various reasons why you'd want it. Smithy4s, a library I maintain, abstracts the business logic away from SBT, as we also provide a CLI and a Mill plugin. Therefore, we essentially have a CodegenArgs => Seq[GeneratedFiles] function that build tools can delegate to.

Now the thing is that we want to use whole CodegenArgs datatype as a a cache key, because a change in any of the data it contains may impact the code rendering, and therefore any change should invalidate the cache. This CodegenArgs datatype contains a bunch of things, including reference to files that may be indiscriminately files or folders. Because it doesn't JUST contain files, FileFunction doesn't cut it for us.

We naively thought files via the functions provided by SBT would hash folders, like it happens in Mill's PathRef, turns out it wasn't the case and we had to take action, after a couple hours of debugging what was very puzzling behaviour.

Now whether Mill's way of hashing folders is actually desirable in SBT is not my place to say, but silently accepting folders and recovering by hashing an empty string is definitely not great, as it very much gives a false-positive behaviour. I think throwing an exception would be better, as at least the implementor could be made aware (via testing) of the problem, and take action. Logging a warning would be also acceptable, though easier to miss.