jscancella / bagging

A clean and easy to use implementation of the BagIt specification
Other
2 stars 7 forks source link

Hashing implementation is not thread-safe #52

Closed pwinckles closed 3 years ago

pwinckles commented 3 years ago

Bagging uses an enum to represent hashers that are used to compute the digests of files that are added to bags. The problem is that since it's an enum the same hasher instance is used globally, which means that it will not compute the correct digest if multiple bags are created concurrently.

The following code demonstrates the problem:

    @Test
    public void concurrentTest() throws IOException {
        var bags = List.of(
                Files.createDirectories(Paths.get("/var/tmp/bag-1")),
                Files.createDirectories(Paths.get("/var/tmp/bag-2"))
        );
        var file = Files.writeString(Paths.get("/var/tmp/test.txt"), "a".repeat(1_000));

        var executor = Executors.newFixedThreadPool(bags.size());
        var phaser = new Phaser(bags.size() + 1);

        bags.forEach(bagDir -> {
            executor.execute(() -> {
                phaser.arriveAndAwaitAdvance();
                try {
                    new BagBuilder()
                            .addAlgorithm("sha256")
                            .bagLocation(bagDir)
                            .addPayloadFile(file)
                            .write();
                    phaser.arrive();
                } catch (Exception e) {
                    e.printStackTrace();
                }
            });
        });

        phaser.arriveAndAwaitAdvance();
        phaser.arriveAndAwaitAdvance();

        bags.forEach(bagDir -> {
            try {
                Bag.read(bagDir).justValidate();
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        });
    }
jscancella commented 3 years ago

Most of the time it doesn't matter as you don't get a performance increase because you are limited by disk IO and multi-threading won't help you there. Are you reading files from multiple disks?

pwinckles commented 3 years ago

It does matter because it means that an application cannot construct multiple different bags concurrently because all of the bytes are going through a single MessageDigest instance.

pwinckles commented 3 years ago

To be clear, this is not a performance consideration at all. As it stands, the library is not safe to use to create or validate bags in a multi-threaded application apart from creating a synchonized wrapper around it.

I would be happy to submit a PR to address the problem, if you'd like.

jscancella commented 3 years ago

Pull requests are welcome, but they will not be merged in until they meet my quality requirements.

pwinckles commented 3 years ago

In my opinion, the following changes should be made:

  1. A new interface, HasherFactory, should be introduced that has a single method, createHasher()
  2. The map in BagitChecksumNameMapping should be changed to hold HasherFactories
  3. A new DefaultHasher class should be introduced that implements Hasher and includes the hashing logic from StandardHasher
  4. The StandardHasher enum should be changed to implement HasherFactory and return new DefaultHasher instances
  5. When BagitChecksumNameMapping.get() is called, the HasherFactory is retrieved from the map and createHasher() is called on it and returned

This will allow you to reuse the hasher as you are currently doing, but every caller of BagitChecksumNameMapping.get() will have their own instance, so there will be no concurrency issues.

jscancella commented 3 years ago

I would also look at if it makes more sense and to stop having the hashers be a singleton. The main idea was to have them immutable, but it looks like it is causing more trouble than it is worth.

jscancella commented 3 years ago

Should be fixed in version 4.4