hdmf-dev / hdmf-zarr

Zarr I/O backend for HDMF
https://hdmf-zarr.readthedocs.io/
Other
7 stars 7 forks source link

Consolidate Metadata #142

Closed mavaylon1 closed 9 months ago

mavaylon1 commented 9 months ago

Motivation

What was the reasoning behind this change? Please explain the changes briefly.

Addressing #139. By default, the metadata will be consolidated into one file, .zmetadata. This will check the existence of this file and will still open the file even if the metadata is not consolidated.

How to test the behavior?

Show how to reproduce the new behavior (can be a bug fix or a new feature)

Checklist

mavaylon1 commented 9 months ago

Fix #139

codecov-commenter commented 9 months ago

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (b51e127) 86.10% compared to head (f23e3f3) 86.45%. Report is 5 commits behind head on dev.

Files Patch % Lines
tests/unit/utils.py 71.42% 2 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## dev #142 +/- ## ========================================== + Coverage 86.10% 86.45% +0.35% ========================================== Files 13 14 +1 Lines 3189 3279 +90 ========================================== + Hits 2746 2835 +89 - Misses 443 444 +1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

mavaylon1 commented 9 months ago

Comments for review

  1. There are a few instances in backend.py that remain zarr.open() rather than __open_consolidated.
  1. Why are there only two tests? With the zarr.consolidate_metadata being called at the end of write_builder, it will generate test examples (including examples with references) that have zmetadata, meaning it's covered by existing tests. (As seen in coverage).

  2. Why is there no consolidate_metadata method anymore? This method was here briefly in this PR, but I removed it since it is literally calling the same method within zarr. I think wrapping it with the same name would just create confusion if something goes wrong, meaning the issue is not us but zarr. I don't want us to have issue tickets saying "it's broken" just because they are not using the zarr method correctly.

oruebel commented 9 months ago

Thanks for the additional comments.

  1. There are a few instances in backend.py that remain zarr.open() rather than __open_consolidated.

    • In Line 81, I left this as a simple test of whether it can be read.

    • In Line 199, the goal is to load the cached namespace from a file.

I agree, it makes sense to leave zarr.open in those two instances.

2. Why are there only two tests? With the zarr.consolidate_metadata being called at the end of write_builder, it will generate test examples (including examples with references) that have zmetadata, meaning it's covered by existing tests. (As seen in coverage).

Makes sense. This raises the question whether we need some flag in the tests to run them with and without consolidation of metadata. I.e., I assume now those tests always run with consolidated metadata but no longer test for the case where metadata is not consolidated.

3. Why is there no consolidate_metadata method anymore? This method was here briefly in this PR, but I removed it since it is literally calling the same method within zarr. I think wrapping it with the same name would just create confusion if something goes wrong, meaning the issue is not us but zarr. I don't want us to have issue tickets saying "it's broken" just because they are not using the zarr method correctly.

When would the user directly need to call consolidate_metadata? If there is a case where a user will need to call this method in zarr, then I think it would be useful to mention that somewhere in the docs.

mavaylon1 commented 9 months ago

Thanks for the additional comments.

  1. There are a few instances in backend.py that remain zarr.open() rather than __open_consolidated.

    • In Line 81, I left this as a simple test of whether it can be read.
    • In Line 199, the goal is to load the cached namespace from a file.

I agree, it makes sense to leave zarr.open in those two instances.

  1. Why are there only two tests? With the zarr.consolidate_metadata being called at the end of write_builder, it will generate test examples (including examples with references) that have zmetadata, meaning it's covered by existing tests. (As seen in coverage).

Makes sense. This raises the question whether we need some flag in the tests to run them with and without consolidation of metadata. I.e., I assume now those tests always run with consolidated metadata but no longer test for the case where metadata is not consolidated.

  1. Why is there no consolidate_metadata method anymore? This method was here briefly in this PR, but I removed it since it is literally calling the same method within zarr. I think wrapping it with the same name would just create confusion if something goes wrong, meaning the issue is not us but zarr. I don't want us to have issue tickets saying "it's broken" just because they are not using the zarr method correctly.

When would the user directly need to call consolidate_metadata? If there is a case where a user will need to call this method in zarr, then I think it would be useful to mention that somewhere in the docs.

  1. Well with this PR, it will always be consolidated. I think it raises a question for how many zarr files are in the wild already (assuming with the Allen Institute) that are from hdmf-zarr prior to this PR. Even so, I can use example.zarr for a test for this case.
  2. The user would need to call it again if they edit the file after it has been written. I can make a quick tutorial for sure.
oruebel commented 9 months ago

Can you also add a brief section on consolidation of metadata to https://github.com/hdmf-dev/hdmf-zarr/blob/dev/docs/source/storage.rst so that we have a record in the docs on where this is stored and why.

oruebel commented 9 months ago

2. Well with this PR, it will always be consolidated.

I don't think we want to make the hard assumption that metadata will always be consolidated.

mavaylon1 commented 9 months ago

This looks overall good to me. One quick question, are we still running any tests now with consolidate_metadata=False or all tests now run with consolidated metadata only?

There's one test to check that it runs with False, in relation to consolidated metadata. However, the rest is using the default True. I don't think it's a big enough change to test for the entire test suite for both but thoughts?

oruebel commented 9 months ago

I don't think it's a big enough change to test for the entire test suite for both but thoughts?

I think that would be nice, if only because it would help safeguard against potential issues as we make changes in the future. That being said, configuring the test suite to run in both modes is probably a non-trivial undertaking and is not worth holding up this PR for. Please just create an issue for future reference and we can decide separately on how to proceed on that question.