hdmf-dev / hdmf-zarr

Zarr I/O backend for HDMF
https://hdmf-zarr.readthedocs.io/
Other
7 stars 6 forks source link

Broken Links when Exporting #194

Closed mavaylon1 closed 1 day ago

mavaylon1 commented 6 months ago

Motivation

This is a refactor of export to formally define mechanisms for references and internal/external links. This Pr is the first stage as it covers only Zarr to Zarr export and does not yet support compound cases.

This also removes TempStore from the export test suite in favor of files that will be created and removed during testing. This helps with debugging and also allows for behavior that better matches the real world.

Export Cases go as follows:

  1. Internal References/Links need to be preserved and not made into ExternalReferences
  2. External Links need to be preserved to the correct target.
  3. We do not support External References.

How to test the behavior?

Show how to reproduce the new behavior (can be a bug fix or a new feature)

Checklist

codecov-commenter commented 6 months ago

Codecov Report

Attention: Patch coverage is 92.72727% with 4 lines in your changes missing coverage. Please review.

Project coverage is 86.52%. Comparing base (fedf38e) to head (8c481ce).

Files with missing lines Patch % Lines
src/hdmf_zarr/backend.py 92.72% 3 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## dev #194 +/- ## ========================================== + Coverage 86.36% 86.52% +0.16% ========================================== Files 5 5 Lines 1188 1210 +22 Branches 297 304 +7 ========================================== + Hits 1026 1047 +21 - Misses 103 104 +1 Partials 59 59 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.


🚨 Try these New Features:

mavaylon1 commented 4 months ago

More Notes: FileA is a HDF5 file and is exported to FileB, a zarr file. FileA has both internal links, external links, and references (which are always internal for us). I remember you saying we don't do external links, but maybe my memory is off. I ask because the export doc talks about external links. Because the backends are different, everything is copied over. Does that mean during export, FileB will have copies of what is being externally linked inside FileB? That also means every internal link and every reference is also preserved. Now I export FileB to FileC, which is still zarr. However, I add a few containers and append to existing containers. What is happening right now in zarr is that new data is added correctly; however, we now create a link to the FileB from FileC if we don't specify copy. This is wrong in that this isn't what export is supposed to do as external links can break easily from moving files. (which is why I ask about external links and why is it talked about in the tutorial). What we need to do is copy everything to FileC, preserving internal links and references.

If FileA contains an external link to a dataset in FileX, then FileB should also contain an external link to the dataset in FileX. 6:23 Same for FileB and FileC. 6:24 If FileB is read, a new external link is added, and then the file is exported to FileC, then it is written as an external link 6:25 If FileB is read, a new external link is added, and then the file is exported to FileC with write_args={'link_data': False}, then the linked dataset is copied 6:25 This is a very specific, niche case

mavaylon1 commented 4 months ago

Goal for this PR:

  1. We want to make sure the external links are just copied when we export, not preserved as links.
    • When export Zarr File A to Zarr File B and add stuff to existing containers, they should all not be links. (this is what Alessio needs)
oruebel commented 3 months ago

Just in case this is relevant for this PR. The following test cases mirror tests from HDMF but were disabled in the hdfm_zarr test suite because links on export didn't fully work. If this PR fixes this, then we should also look at updating these tests.

https://github.com/hdmf-dev/hdmf-zarr/blob/8ca578733db5078d1ff6d2dfb47407a680c58caf/tests/unit/base_tests_zarrio.py#L1329-L1521

https://github.com/hdmf-dev/hdmf-zarr/blob/8ca578733db5078d1ff6d2dfb47407a680c58caf/tests/unit/test_io_convert.py#L993-L1069

mavaylon1 commented 3 months ago

Just in case this is relevant for this PR. The following test cases mirror tests from HDMF but were disabled in the hdfm_zarr test suite because links on export didn't fully work. If this PR fixes this, then we should also look at updating these tests.

https://github.com/hdmf-dev/hdmf-zarr/blob/8ca578733db5078d1ff6d2dfb47407a680c58caf/tests/unit/base_tests_zarrio.py#L1329-L1521

https://github.com/hdmf-dev/hdmf-zarr/blob/8ca578733db5078d1ff6d2dfb47407a680c58caf/tests/unit/test_io_convert.py#L993-L1069

Good to know. I believe my tests are similar if not the same ones. Thanks for pointing this out so we don't have duplicates.

mavaylon1 commented 3 months ago

Related Issues: #179 #205

mavaylon1 commented 3 weeks ago

Review Notes: This is the first stage PR. Some tests are still commented out, but the changes should be enough to unblock the Allen Institute as long as they do not have compound cases (TBD) and backend changes (they do not).

Things are subject to change during a refactor that will be done before the end of November. This is to really checkpoint the work and have the team be on the same page with the changes. That being said, optimizations will be made and the final product is what will lead to a large release of hdmf-zarr.

I don't have an issue ticket this directly fixes. However, we do have some that are related. #179

oruebel commented 3 weeks ago

I'm still making my way though the changes in backend.py. Just to double-check, are the changes here compatible with existing NWB Zarr files? As far as I can tell, the answer is probably yes, but wanted to check.

You may also want to check that the docs at https://github.com/hdmf-dev/hdmf-zarr/blob/dev/docs/source/storage.rst don't need to be updated for this.

mavaylon1 commented 3 weeks ago

I'm still making my way though the changes in backend.py. Just to double-check, are the changes here compatible with existing NWB Zarr files? As far as I can tell, the answer is probably yes, but wanted to check.

You may also want to check that the docs at https://github.com/hdmf-dev/hdmf-zarr/blob/dev/docs/source/storage.rst don't need to be updated for this.

So I believe any file with just internal links and references should be fine. Anything with External may not work.

oruebel commented 3 weeks ago

So I believe any file with just internal links and references should be fine. Anything with External may not work.

In that case, I would ping Alessio to see if he can test out the PR to make sure it works with their files

mavaylon1 commented 3 weeks ago

So I believe any file with just internal links and references should be fine. Anything with External may not work.

In that case, I would ping Alessio to see if he can test out the PR to make sure it works with their files

I will ping him once we are solid on this PR. Let me address your comments and we can shoot to ping him Monday!

rly commented 2 days ago

I made some minor comments regarding the tests. Otherwise looks good to me.