hdmf-dev / hdmf-zarr

Zarr I/O backend for HDMF
https://hdmf-zarr.readthedocs.io/
Other
7 stars 6 forks source link

Broken Links when Exporting #194

Open mavaylon1 opened 4 months ago

mavaylon1 commented 4 months ago

Motivation

What was the reasoning behind this change? Please explain the changes briefly. Export is not supposed to create links to the prior file, rather it is just mean to have the option to preserve them. This means if File A has links to some File C, then when we export File A to File B, File B will also have links the File C.

Problem 1: HDMF-Zarr is missing the logic in HDMF within write_dataset that has conditionals for link in terms of export.

Problem 2: When links are creating (let's say when they are supposed to be created), they are using absolute paths. They are supposed to use relative paths. Both can break when you move things, but absolute paths will always break.

Problem 3: When we create a reference, the source path is shorthand with ".", to represent the file it is currently in. We need to add logic in resolve_ref to handle links.

What to do while this is being fixed:

Always use 'write_args={'link_data': False}'. I will divide the problem into stages: Stage 1 (PR 1:) Add updated export logic into write_dataset

Stage 2 (PR 2:) Add logic into resolve_ref to resolve references in links

Stage 3 (PR 3:) Edge case Test Suite

How to test the behavior?

Show how to reproduce the new behavior (can be a bug fix or a new feature)

Checklist

codecov-commenter commented 4 months ago

Codecov Report

Attention: Patch coverage is 71.42857% with 4 lines in your changes missing coverage. Please review.

Project coverage is 86.88%. Comparing base (8ca5787) to head (16876ca).

Files Patch % Lines
src/hdmf_zarr/backend.py 71.42% 3 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## dev #194 +/- ## ========================================== - Coverage 87.11% 86.88% -0.23% ========================================== Files 5 5 Lines 1172 1182 +10 Branches 286 289 +3 ========================================== + Hits 1021 1027 +6 - Misses 100 103 +3 - Partials 51 52 +1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

mavaylon1 commented 2 months ago

More Notes: FileA is a HDF5 file and is exported to FileB, a zarr file. FileA has both internal links, external links, and references (which are always internal for us). I remember you saying we don't do external links, but maybe my memory is off. I ask because the export doc talks about external links. Because the backends are different, everything is copied over. Does that mean during export, FileB will have copies of what is being externally linked inside FileB? That also means every internal link and every reference is also preserved. Now I export FileB to FileC, which is still zarr. However, I add a few containers and append to existing containers. What is happening right now in zarr is that new data is added correctly; however, we now create a link to the FileB from FileC if we don't specify copy. This is wrong in that this isn't what export is supposed to do as external links can break easily from moving files. (which is why I ask about external links and why is it talked about in the tutorial). What we need to do is copy everything to FileC, preserving internal links and references.

If FileA contains an external link to a dataset in FileX, then FileB should also contain an external link to the dataset in FileX. 6:23 Same for FileB and FileC. 6:24 If FileB is read, a new external link is added, and then the file is exported to FileC, then it is written as an external link 6:25 If FileB is read, a new external link is added, and then the file is exported to FileC with write_args={'link_data': False}, then the linked dataset is copied 6:25 This is a very specific, niche case

mavaylon1 commented 2 months ago

Goal for this PR:

  1. We want to make sure the external links are just copied when we export, not preserved as links.
    • When export Zarr File A to Zarr File B and add stuff to existing containers, they should all not be links. (this is what Alessio needs)
oruebel commented 2 months ago

Just in case this is relevant for this PR. The following test cases mirror tests from HDMF but were disabled in the hdfm_zarr test suite because links on export didn't fully work. If this PR fixes this, then we should also look at updating these tests.

https://github.com/hdmf-dev/hdmf-zarr/blob/8ca578733db5078d1ff6d2dfb47407a680c58caf/tests/unit/base_tests_zarrio.py#L1329-L1521

https://github.com/hdmf-dev/hdmf-zarr/blob/8ca578733db5078d1ff6d2dfb47407a680c58caf/tests/unit/test_io_convert.py#L993-L1069

mavaylon1 commented 2 months ago

Just in case this is relevant for this PR. The following test cases mirror tests from HDMF but were disabled in the hdfm_zarr test suite because links on export didn't fully work. If this PR fixes this, then we should also look at updating these tests.

https://github.com/hdmf-dev/hdmf-zarr/blob/8ca578733db5078d1ff6d2dfb47407a680c58caf/tests/unit/base_tests_zarrio.py#L1329-L1521

https://github.com/hdmf-dev/hdmf-zarr/blob/8ca578733db5078d1ff6d2dfb47407a680c58caf/tests/unit/test_io_convert.py#L993-L1069

Good to know. I believe my tests are similar if not the same ones. Thanks for pointing this out so we don't have duplicates.

mavaylon1 commented 1 month ago

Related Issues: #179 #205