Closed hanslovsky closed 4 years ago
We can use the same workaround where we save all blocks including empty ones if requested to overwrite existing data.
That is a reasonable option but I think I prefer not to write blocks that are empty, e.g. in scenarios where a significant portion of the data is empty as brought up in saalfeldlab/paintera-conversion-helper#33. What is empty in the case of N5DownsamplerSpark
? I can imagine two options, (1) the entire output block is zero (or some other default value), and (2), all contributing blocks are not there.
I do prefer deleting instead of writing empty blocks, but that would require changes to N5Writer
. I will look into this. And eventually, we should update the converter to remove empty blocks, as well. I am ok with waiting for those changes and then returning to this PR but if you prefer an immediate temporary solution, I can go ahead and write out empty blocks.
Although, since the readers will work only within the new dimensions, the rest of the old blocks should be simply ignored. So in my understanding this should be fine as it is if I'm not missing anything.
Agreed.
and it would be also great if you could add a command line argument to support the new overwrite functionality
Will do, once the internals of overwrite are figured out.
I use the default value of imglib2 types via createVariable()
, at first checking if the input interval is empty, and then n5-imglib2
checks if the output block is empty or not:
https://github.com/saalfeldlab/n5-spark/blob/d49c8492106abface1d1dc7a81cb2c25c8ec6443/src/main/java/org/janelia/saalfeldlab/n5/spark/downsample/N5DownsamplerSpark.java#L212-L227
I agree that it's a good idea to be able to delete blocks through N5 API, maybe it's possible to do it even without modifying the interface (such as the block can be deleted if calling writeBlock()
with null
).
Unfortunately, null
does not work for writeBlock
because the DataBlock
also provides the blockPosition
(DataBlock.getGridPosition()
), but a special implementaiton of DataBlock
could indidcate deletion.
Anyway, I created saalfeldlab/n5#55 for discussing the details.
Now that https://github.com/saalfeldlab/n5/pull/58 has been merged, this can be updated to delete blocks that were present in the original dataset but are not present in the new dataset.
@igorpisarev I added a N5Writer.deleteBlock
call before N5Utils.saveNonEmptyBlock
. This depends on the latest saalfeldlab/n5 master and needs to wait for a release. This should be handled similarly for label data in saalfeldlab/label-utilities-spark
and saalfeldlab/paintera-conversion-helper
.
Update: Once there is a release of saalfeldlab/n5, this branch should probably be rebased over master to resolve conflicts without introducing unnecessary merge commits.
Update2: I realized that we need to use something like N5Utils.saveNonEmptyBlock
but for deleteBlock
here. I c&p'ed but it would probably be a good idea to extract functionality into n2-imglib2
.
This should also be addressed in N5LabelDownsamplerSpark
and N5ConvertSpark
.
@hanslovsky Good point about deleting all contained blocks, since the parallelization could be done over a larger block. I agree that it makes sense to move this method to n5-imglib2 as it follows the same logic and would probably be useful in several different places.
I just realized that I had made the changes in the wrong file. I added the proposed changes to the N5DownsamplerSpark
as well as to N5ConvertSpark
, overwriting is now possible in
N5DownsamplerSpark
N5LabelDownsamplerSpark
N5ConvertSpark
Right now, all of these contain the same static deleteBlock
method that should be introduced in n5-imglib
(or maybe even n5
core without the Interval
interface).
I've updated the PR with N5Utils.deleteBlock()
, but it seems that it's not enough and we also need to make sure that the new block size is the same as the old block size (of the existing dataset). Otherwise, there may be problems when the new block size is not a multiple of the old block size. In such cases, the best option would be to delete the old dataset with N5RemoveSpark
and then simply write out the new dataset.
@hanslovsky Sorry for not commenting earlier, I somehow missed the notification about this PR. I'm afraid that it has the same potential issue as in #1: the downsampler saves only non-empty blocks, so if a block is non-empty in the old data but is empty in the new data, the old block will be left untouched, and the resulting dataset will be inconsistent. We can use the same workaround where we save all blocks including empty ones if requested to overwrite existing data. An alternative may be to delete the unneeded data block files, but I think the current N5 API doesn't support that.
The same consideration also applies for the case when the old dataset is larger than the new one, and the overhanging old blocks will be left untouched. Although, since the readers will work only within the new dimensions, the rest of the old blocks should be simply ignored. So in my understanding this should be fine as it is if I'm not missing anything.
I would vote for adding a similar workaround as in #1 for empty blocks, and it would be also great if you could add a command line argument to support the new overwrite functionality if running as a standalone package.