saalfeldlab / n5-spark

Spark-driven processing utilities for N5 datasets.
BSD 2-Clause "Simplified" License
3 stars 7 forks source link

Add option to overwrite existing to N5DownsamplerSpark.downsample #12

Closed hanslovsky closed 4 years ago

igorpisarev commented 5 years ago

@hanslovsky Sorry for not commenting earlier, I somehow missed the notification about this PR. I'm afraid that it has the same potential issue as in #1: the downsampler saves only non-empty blocks, so if a block is non-empty in the old data but is empty in the new data, the old block will be left untouched, and the resulting dataset will be inconsistent. We can use the same workaround where we save all blocks including empty ones if requested to overwrite existing data. An alternative may be to delete the unneeded data block files, but I think the current N5 API doesn't support that.

The same consideration also applies for the case when the old dataset is larger than the new one, and the overhanging old blocks will be left untouched. Although, since the readers will work only within the new dimensions, the rest of the old blocks should be simply ignored. So in my understanding this should be fine as it is if I'm not missing anything.

I would vote for adding a similar workaround as in #1 for empty blocks, and it would be also great if you could add a command line argument to support the new overwrite functionality if running as a standalone package.

hanslovsky commented 5 years ago

We can use the same workaround where we save all blocks including empty ones if requested to overwrite existing data.

That is a reasonable option but I think I prefer not to write blocks that are empty, e.g. in scenarios where a significant portion of the data is empty as brought up in saalfeldlab/paintera-conversion-helper#33. What is empty in the case of N5DownsamplerSpark? I can imagine two options, (1) the entire output block is zero (or some other default value), and (2), all contributing blocks are not there.

I do prefer deleting instead of writing empty blocks, but that would require changes to N5Writer. I will look into this. And eventually, we should update the converter to remove empty blocks, as well. I am ok with waiting for those changes and then returning to this PR but if you prefer an immediate temporary solution, I can go ahead and write out empty blocks.

Although, since the readers will work only within the new dimensions, the rest of the old blocks should be simply ignored. So in my understanding this should be fine as it is if I'm not missing anything.

Agreed.

and it would be also great if you could add a command line argument to support the new overwrite functionality

Will do, once the internals of overwrite are figured out.

igorpisarev commented 5 years ago

I use the default value of imglib2 types via createVariable(), at first checking if the input interval is empty, and then n5-imglib2 checks if the output block is empty or not: https://github.com/saalfeldlab/n5-spark/blob/d49c8492106abface1d1dc7a81cb2c25c8ec6443/src/main/java/org/janelia/saalfeldlab/n5/spark/downsample/N5DownsamplerSpark.java#L212-L227

I agree that it's a good idea to be able to delete blocks through N5 API, maybe it's possible to do it even without modifying the interface (such as the block can be deleted if calling writeBlock() with null).

hanslovsky commented 5 years ago

Unfortunately, null does not work for writeBlock because the DataBlock also provides the blockPosition (DataBlock.getGridPosition()), but a special implementaiton of DataBlock could indidcate deletion.

Anyway, I created saalfeldlab/n5#55 for discussing the details.

igorpisarev commented 4 years ago

Now that https://github.com/saalfeldlab/n5/pull/58 has been merged, this can be updated to delete blocks that were present in the original dataset but are not present in the new dataset.

hanslovsky commented 4 years ago

@igorpisarev I added a N5Writer.deleteBlock call before N5Utils.saveNonEmptyBlock. This depends on the latest saalfeldlab/n5 master and needs to wait for a release. This should be handled similarly for label data in saalfeldlab/label-utilities-spark and saalfeldlab/paintera-conversion-helper.

Update: Once there is a release of saalfeldlab/n5, this branch should probably be rebased over master to resolve conflicts without introducing unnecessary merge commits.

Update2: I realized that we need to use something like N5Utils.saveNonEmptyBlock but for deleteBlock here. I c&p'ed but it would probably be a good idea to extract functionality into n2-imglib2.

hanslovsky commented 4 years ago

This should also be addressed in N5LabelDownsamplerSpark and N5ConvertSpark.

igorpisarev commented 4 years ago

@hanslovsky Good point about deleting all contained blocks, since the parallelization could be done over a larger block. I agree that it makes sense to move this method to n5-imglib2 as it follows the same logic and would probably be useful in several different places.

hanslovsky commented 4 years ago

I just realized that I had made the changes in the wrong file. I added the proposed changes to the N5DownsamplerSpark as well as to N5ConvertSpark, overwriting is now possible in

Right now, all of these contain the same static deleteBlock method that should be introduced in n5-imglib (or maybe even n5 core without the Interval interface).

igorpisarev commented 4 years ago

I've updated the PR with N5Utils.deleteBlock(), but it seems that it's not enough and we also need to make sure that the new block size is the same as the old block size (of the existing dataset). Otherwise, there may be problems when the new block size is not a multiple of the old block size. In such cases, the best option would be to delete the old dataset with N5RemoveSpark and then simply write out the new dataset.