haniffalab / webatlas-pipeline

A data pipeline built in Nextflow to process spatial and single-cell experiment data for visualisation in WebAtlas
MIT License
25 stars 4 forks source link

Bump bioformats2raw version from 0.2 to 0.4 #13

Closed BioinfoTongLI closed 1 year ago

BioinfoTongLI commented 2 years ago

bump the tif-to-zarr conversion from 0.2 to the latest stable version (0.4.0). currently using this image (https://hub.docker.com/layers/bioformats2raw/openmicroscopy/bioformats2raw/0.4.0/images/sha256-29e650dca4610898d2c5d7639c350f172d3f4d0d0aea7078454b76e10245b0c7?context=explore)

Vitessce works with this version as well.

Tho, the conversion is currently only done locally. Use this option to write directly to s3. https://github.com/glencoesoftware/bioformats2raw/pull/89

BioinfoTongLI commented 2 years ago

writing directly to s3 leads to a super long error log, which does happen when saving locally. Here's the end of the error log


2022-05-26 07:40:08,940 [pool-1-thread-1] ERROR c.g.bioformats2raw.Converter - Failure processing chunk; resolution=0 plane=1 xx=16384 yy=16384 zz=0 width=304 height=736 depth=1
java.lang.NullPointerException: null
    at com.upplication.s3fs.S3AccessControlList.hasPermission(S3AccessControlList.java:39)
    at com.upplication.s3fs.S3AccessControlList.checkAccess(S3AccessControlList.java:50)
    at com.upplication.s3fs.S3FileSystemProvider.checkAccess(S3FileSystemProvider.java:470)
    at java.nio.file.Files.isAccessible(Files.java:2455)
    at java.nio.file.Files.isReadable(Files.java:2490)
    at com.bc.zarr.storage.FileSystemStore.getInputStream(FileSystemStore.java:61)
    at com.bc.zarr.ZarrArray.open(ZarrArray.java:103)
    at com.bc.zarr.ZarrArray.open(ZarrArray.java:96)
    at com.bc.zarr.ZarrArray.open(ZarrArray.java:92)
    at com.glencoesoftware.bioformats2raw.Converter.processChunk(Converter.java:1039)
    at com.glencoesoftware.bioformats2raw.Converter.lambda$saveResolutions$4(Converter.java:1286)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
2022-05-26 07:40:09,207 [pool-1-thread-1] ERROR c.g.bioformats2raw.Converter - Failure processing chunk; resolution=0 plane=2 xx=16384 yy=16384 zz=0 width=304 height=736 depth=1
java.lang.NullPointerException: null
    at com.upplication.s3fs.S3AccessControlList.hasPermission(S3AccessControlList.java:39)
    at com.upplication.s3fs.S3AccessControlList.checkAccess(S3AccessControlList.java:50)
    at com.upplication.s3fs.S3FileSystemProvider.checkAccess(S3FileSystemProvider.java:470)
    at java.nio.file.Files.isAccessible(Files.java:2455)
    at java.nio.file.Files.isReadable(Files.java:2490)
    at com.bc.zarr.storage.FileSystemStore.getInputStream(FileSystemStore.java:61)
    at com.bc.zarr.ZarrArray.open(ZarrArray.java:103)
    at com.bc.zarr.ZarrArray.open(ZarrArray.java:96)
    at com.bc.zarr.ZarrArray.open(ZarrArray.java:92)
    at com.glencoesoftware.bioformats2raw.Converter.processChunk(Converter.java:1039)
    at com.glencoesoftware.bioformats2raw.Converter.lambda$saveResolutions$4(Converter.java:1286)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
2022-05-26 07:40:09,207 [main] ERROR c.g.bioformats2raw.Converter - Error while writing series 0
java.util.concurrent.CompletionException: java.lang.NullPointerException
    at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
    at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
    at java.util.concurrent.CompletableFuture.biRelay(CompletableFuture.java:1298)
    at java.util.concurrent.CompletableFuture.andTree(CompletableFuture.java:1321)
    at java.util.concurrent.CompletableFuture.andTree(CompletableFuture.java:1317)
    at java.util.concurrent.CompletableFuture.andTree(CompletableFuture.java:1317)
    at java.util.concurrent.CompletableFuture.andTree(CompletableFuture.java:1317)
    at java.util.concurrent.CompletableFuture.andTree(CompletableFuture.java:1317)
    at java.util.concurrent.CompletableFuture.andTree(CompletableFuture.java:1317)
    at java.util.concurrent.CompletableFuture.andTree(CompletableFuture.java:1317)
    at java.util.concurrent.CompletableFuture.andTree(CompletableFuture.java:1317)
    at java.util.concurrent.CompletableFuture.andTree(CompletableFuture.java:1317)
    at java.util.concurrent.CompletableFuture.andTree(CompletableFuture.java:1317)
    at java.util.concurrent.CompletableFuture.allOf(CompletableFuture.java:2238)
    at com.glencoesoftware.bioformats2raw.Converter.saveResolutions(Converter.java:1314)
    at com.glencoesoftware.bioformats2raw.Converter.write(Converter.java:691)
    at com.glencoesoftware.bioformats2raw.Converter.convert(Converter.java:646)
    at com.glencoesoftware.bioformats2raw.Converter.call(Converter.java:477)
    at com.glencoesoftware.bioformats2raw.Converter.call(Converter.java:92)
    at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
    at picocli.CommandLine.access$1300(CommandLine.java:145)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
    at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:2172)
    at picocli.CommandLine.parseWithHandlers(CommandLine.java:2550)
    at picocli.CommandLine.parseWithHandler(CommandLine.java:2485)
    at picocli.CommandLine.call(CommandLine.java:2761)
    at com.glencoesoftware.bioformats2raw.Converter.main(Converter.java:1808)
Caused by: java.lang.NullPointerException: null
    at com.upplication.s3fs.S3AccessControlList.hasPermission(S3AccessControlList.java:39)
    at com.upplication.s3fs.S3AccessControlList.checkAccess(S3AccessControlList.java:50)
    at com.upplication.s3fs.S3FileSystemProvider.checkAccess(S3FileSystemProvider.java:470)
    at java.nio.file.Files.isAccessible(Files.java:2455)
    at java.nio.file.Files.isReadable(Files.java:2490)
    at com.bc.zarr.storage.FileSystemStore.getInputStream(FileSystemStore.java:61)
    at com.bc.zarr.ZarrArray.open(ZarrArray.java:103)
    at com.bc.zarr.ZarrArray.open(ZarrArray.java:96)
    at com.bc.zarr.ZarrArray.open(ZarrArray.java:92)
    at com.glencoesoftware.bioformats2raw.Converter.processChunk(Converter.java:1039)
    at com.glencoesoftware.bioformats2raw.Converter.lambda$saveResolutions$4(Converter.java:1286)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Exception in thread "main" picocli.CommandLine$ExecutionException: Error while calling command (com.glencoesoftware.bioformats2raw.Converter@6b67034): java.lang.NullPointerException
    at picocli.CommandLine.executeUserObject(CommandLine.java:1962)
    at picocli.CommandLine.access$1300(CommandLine.java:145)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
    at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:2172)
    at picocli.CommandLine.parseWithHandlers(CommandLine.java:2550)
    at picocli.CommandLine.parseWithHandler(CommandLine.java:2485)
    at picocli.CommandLine.call(CommandLine.java:2761)
    at com.glencoesoftware.bioformats2raw.Converter.main(Converter.java:1808)
Caused by: java.lang.NullPointerException
    at com.upplication.s3fs.S3AccessControlList.hasPermission(S3AccessControlList.java:39)
    at com.upplication.s3fs.S3AccessControlList.checkAccess(S3AccessControlList.java:50)
    at com.upplication.s3fs.S3FileSystemProvider.checkAccess(S3FileSystemProvider.java:470)
    at java.nio.file.Files.isAccessible(Files.java:2455)
    at java.nio.file.Files.isReadable(Files.java:2490)
    at com.bc.zarr.storage.FileSystemStore.getInputStream(FileSystemStore.java:61)
    at com.bc.zarr.ZarrArray.open(ZarrArray.java:103)
    at com.bc.zarr.ZarrArray.open(ZarrArray.java:96)
    at com.bc.zarr.ZarrArray.open(ZarrArray.java:92)
    at com.glencoesoftware.bioformats2raw.Converter.processChunk(Converter.java:1039)
    at com.glencoesoftware.bioformats2raw.Converter.lambda$saveResolutions$4(Converter.java:1286)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
joshmoore commented 2 years ago

@BioinfoTongLI : can you include the command you used? (i.e. is --entrypoint-url involved?)

BioinfoTongLI commented 2 years ago

Not using entrypoint-url. The $image is a local image. And the conversion was correct when not using s3 writing /opt/bioformats2raw/bin/bioformats2raw --output-options s3fs_path_style_access=true ${image} s3://${accessKey}:${secretKey}@webatlas.cog.sanger.ac.uk/deleteme/

joshmoore commented 2 years ago

If you were accessing this via aws, I think this would be:

aws --entrypoint-url https://cog.sanger.ac.uk s3://webatlas/...

with webatlas being the bucket. Adding the bucket to the front of the entrypoint ("webatlas.cog.sanger.ac.uk") is virtual-hosted-style as opposed to path-style:

https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html#path-style-access

So perhaps try setting s3fs_path_style_access=false (or just omitting it).

BioinfoTongLI commented 2 years ago

Still seeing the same null pointer error.


  2022-05-27 08:36:54,083 [main] ERROR c.g.bioformats2raw.Converter - Error while writing series 0
  java.util.concurrent.CompletionException: java.lang.NullPointerException
    at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
    at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
    at java.util.concurrent.CompletableFuture.biRelay(CompletableFuture.java:1298)
    at java.util.concurrent.CompletableFuture.andTree(CompletableFuture.java:1321)
    at java.util.concurrent.CompletableFuture.andTree(CompletableFuture.java:1317)
    at java.util.concurrent.CompletableFuture.andTree(CompletableFuture.java:1317)
    at java.util.concurrent.CompletableFuture.andTree(CompletableFuture.java:1317)
    at java.util.concurrent.CompletableFuture.andTree(CompletableFuture.java:1317)
    at java.util.concurrent.CompletableFuture.andTree(CompletableFuture.java:1317)
    at java.util.concurrent.CompletableFuture.andTree(CompletableFuture.java:1317)
    at java.util.concurrent.CompletableFuture.andTree(CompletableFuture.java:1317)
    at java.util.concurrent.CompletableFuture.andTree(CompletableFuture.java:1317)
    at java.util.concurrent.CompletableFuture.andTree(CompletableFuture.java:1317)
    at java.util.concurrent.CompletableFuture.allOf(CompletableFuture.java:2238)
    at com.glencoesoftware.bioformats2raw.Converter.saveResolutions(Converter.java:1314)
    at com.glencoesoftware.bioformats2raw.Converter.write(Converter.java:691)
    at com.glencoesoftware.bioformats2raw.Converter.convert(Converter.java:646)
    at com.glencoesoftware.bioformats2raw.Converter.call(Converter.java:477)
    at com.glencoesoftware.bioformats2raw.Converter.call(Converter.java:92)
    at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
    at picocli.CommandLine.access$1300(CommandLine.java:145)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
    at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:2172)
    at picocli.CommandLine.parseWithHandlers(CommandLine.java:2550)
    at picocli.CommandLine.parseWithHandler(CommandLine.java:2485)
    at picocli.CommandLine.call(CommandLine.java:2761)
    at com.glencoesoftware.bioformats2raw.Converter.main(Converter.java:1808)
  Caused by: java.lang.NullPointerException: null
    at com.upplication.s3fs.S3AccessControlList.hasPermission(S3AccessControlList.java:39)
    at com.upplication.s3fs.S3AccessControlList.checkAccess(S3AccessControlList.java:50)
    at com.upplication.s3fs.S3FileSystemProvider.checkAccess(S3FileSystemProvider.java:470)
    at java.nio.file.Files.isAccessible(Files.java:2455)
    at java.nio.file.Files.isReadable(Files.java:2490)
    at com.bc.zarr.storage.FileSystemStore.getInputStream(FileSystemStore.java:61)
    at com.bc.zarr.ZarrArray.open(ZarrArray.java:103)
    at com.bc.zarr.ZarrArray.open(ZarrArray.java:96)
    at com.bc.zarr.ZarrArray.open(ZarrArray.java:92)
    at com.glencoesoftware.bioformats2raw.Converter.processChunk(Converter.java:1039)
    at com.glencoesoftware.bioformats2raw.Converter.lambda$saveResolutions$4(Converter.java:1286)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Command error:
  OpenJDK 64-Bit Server VM warning: You have loaded library /tmp/opencv_openpnp921420814146520776/nu/pattern/opencv/linux/x86_64/libopencv_java342.so which might have disabled stack guard. The VM will try to fix the stack guard now.
  It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
  Exception in thread "main" picocli.CommandLine$ExecutionException: Error while calling command (com.glencoesoftware.bioformats2raw.Converter@6b67034): java.lang.NullPointerException
    at picocli.CommandLine.executeUserObject(CommandLine.java:1962)
    at picocli.CommandLine.access$1300(CommandLine.java:145)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
    at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:2172)
    at picocli.CommandLine.parseWithHandlers(CommandLine.java:2550)
    at picocli.CommandLine.parseWithHandler(CommandLine.java:2485)
    at picocli.CommandLine.call(CommandLine.java:2761)
    at com.glencoesoftware.bioformats2raw.Converter.main(Converter.java:1808)
  Caused by: java.lang.NullPointerException
    at com.upplication.s3fs.S3AccessControlList.hasPermission(S3AccessControlList.java:39)
    at com.upplication.s3fs.S3AccessControlList.checkAccess(S3AccessControlList.java:50)
    at com.upplication.s3fs.S3FileSystemProvider.checkAccess(S3FileSystemProvider.java:470)
    at java.nio.file.Files.isAccessible(Files.java:2455)
    at java.nio.file.Files.isReadable(Files.java:2490)
    at com.bc.zarr.storage.FileSystemStore.getInputStream(FileSystemStore.java:61)
    at com.bc.zarr.ZarrArray.open(ZarrArray.java:103)
    at com.bc.zarr.ZarrArray.open(ZarrArray.java:96)
    at com.bc.zarr.ZarrArray.open(ZarrArray.java:92)
    at com.glencoesoftware.bioformats2raw.Converter.processChunk(Converter.java:1039)
    at com.glencoesoftware.bioformats2raw.Converter.lambda$saveResolutions$4(Converter.java:1286)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Afaik, this s3 is not on aws at all. Instead, we used CEPH (https://www.redhat.com/en/technologies/storage/ceph). But should be similar to what EBI is using s3.embassy.ebi.ac.uk/idr-upload... @prete any ideas?

joshmoore commented 2 years ago

I assume then that we will need to start using our own s3filesystem. See https://imagesc.zulipchat.com/#narrow/stream/212929-general/topic/ome-zarr.20basics.3A.20writing.20.20to.20s3/near/281819192 for a related conversation. Would be good to know how much time/space uploading directly will get you to know how important it is to prioritize this.

BioinfoTongLI commented 2 years ago

I see - it is currently not the most urgent task. nextflow can do the push and works fine with our CEPH storage. The cost is that we need to duplicate the data before the push. Tho, this might be an issue when it comes to the real atlas dataset (100 + whole embryo images). Let's priotize this in the next milestone.

prete commented 2 years ago

Afaik, this s3 is not on aws at all. Instead, we used CEPH (https://www.redhat.com/en/technologies/storage/ceph). But should be similar to what EBI is using s3.embassy.ebi.ac.uk/idr-upload...

Indeed it's Ceph's rados gateway. Note: aws that Josh used is the awscli tool that can also talk to S3 compatible storages (like our "Sanger S3"). Think of it as a s3cmd alternative for you.

Uploading from bioformats2raw should work like this for you:

bioformats2raw \
    --output-options "s3fs_access_key=${accessKey}|s3fs_secret_key=${secretKey}|s3fs_path_style_access=true" \
    ${image} \
    s3://cog.sanger.ac.uk/webatlas/deleteme/

Keep in mind that uploading straight to S3 will slow down the process, because uploading is slower than disk I/O. But, like you said, won't duplicate the data and you won't have to copy it afterwards... so it's up to you!

BioinfoTongLI commented 2 years ago

Thanks @prete! Interestingly your syntax works.... I am pretty sure the original authentification version works as well, since I do have files created by bioformats2raw. But it seems that passing them through output-options is the right way.

@joshmoore worth an issue to glencoe?

joshmoore commented 2 years ago

My best guess is that s3://cog.sanger.ac.uk/webatlas vs. s3://webatlas.cog.sanger.ac.uk/. All this comes down to the fact that the S3 "standard" is a far cry from POSIX. You can open an issue on bioformats2raw but this is more a question of the underlying FileSystem implementation -- https://github.com/lasersonlab/Amazon-S3-FileSystem-NIO2 -- and if you look at the upstream repo issues (https://github.com/Upplication/Amazon-S3-FileSystem-NIO2/issues) you'll see that the latest one is "is this dead?". I've brought this up a few times on image.sc. Ultimately, we will likely need to work on a single implementation as a community.

BioinfoTongLI commented 1 year ago

All seems to be working fine. Closing this for now. Reopen if needed.