hdmf-dev / hdmf

The Hierarchical Data Modeling Framework
http://hdmf.readthedocs.io
Other
47 stars 26 forks source link

"Copy" via "export" is "larger" (10x fold in this silly example) than original! #1187

Closed yarikoptic closed 1 month ago

yarikoptic commented 1 month ago

Follow up to

If we use the same script as provided in #1186 with not broken hdmf 3.14.3, we get

❯ /tmp/simple2.py /tmp/simple2.nwb /tmp/simple2-copy.nwb
Copying /tmp/simple2.nwb /tmp/simple2-copy.nwb
Now reading /tmp/simple2-copy.nwb
/tmp/simple2.py /tmp/simple2.nwb /tmp/simple2-copy.nwb  5.06s user 1.36s system 211% cpu 3.033 total
❯ ls -l /tmp/simple2.nwb /tmp/simple2-copy.nwb
-rw-rw-r-- 1 yoh yoh 189120 Sep  5 15:24 /tmp/simple2-copy.nwb
-rw-rw-r-- 1 yoh yoh  19664 Sep  5 15:18 /tmp/simple2.nwb

so you can see that "copied" file is 189k while original just 19k. Is that expected/desired/unavoidable?

output of diff -Naur <(h5dump /tmp/simple2.nwb) <(h5dump /tmp/simple2-copy.nwb): http://www.oneukrainian.com/tmp/simple2-h5dump.diff

Original file is produced using this pytest fixture https://github.com/dandi/dandi-cli/blob/HEAD/dandi/tests/fixtures.py#L101

PS feel welcome to reassign to pynwb is the issue is there .

rly commented 1 month ago

@yarikoptic Your dandi pytest fixture writes NWB files without caching the spec. The export call caches the spec by default. I believe that explains all of the diff. If you want to export without caching the spec, you currently cannot do that using pynwb but we are going to remedy that in a quick bugfix to pynwb.

yarikoptic commented 1 month ago

coolio, thanks @rly for quick response! And confirming on above example that we would get the same size and only id changed as requested

❯ /tmp/simple2.py /tmp/simple2.nwb /tmp/simple2-copy.nwb && ls -l /tmp/simple2.nwb /tmp/simple2-copy.nwb
Copying /tmp/simple2.nwb /tmp/simple2-copy.nwb using pywnb 2.5.0.post0.dev15
Now reading /tmp/simple2-copy.nwb
/tmp/simple2.py /tmp/simple2.nwb /tmp/simple2-copy.nwb  3.32s user 2.43s system 229% cpu 2.510 total
-rw-rw-r-- 1 yoh yoh 19664 Sep  6 14:48 /tmp/simple2-copy.nwb
-rw-rw-r-- 1 yoh yoh 19664 Sep  5 15:18 /tmp/simple2.nwb
❯ diff -Naur <(h5dump /tmp/simple2.nwb) <(h5dump /tmp/simple2-copy.nwb)
--- /proc/self/fd/18    2024-09-06 14:48:31.938598041 -0400
+++ /proc/self/fd/19    2024-09-06 14:48:31.938598041 -0400
@@ -1,4 +1,4 @@
-HDF5 "/tmp/simple2.nwb" {
+HDF5 "/tmp/simple2-copy.nwb" {
 GROUP "/" {
    ATTRIBUTE "namespace" {
       DATATYPE  H5T_STRING {
@@ -45,7 +45,7 @@
       }
       DATASPACE  SCALAR
       DATA {
-      (0): "154bbc4f-4276-47db-bac9-f7cdc8880aa4"
+      (0): "c8b730fc-f3bf-4619-8069-c66f5ff0a9aa"
       }
    }
    GROUP "acquisition" {
@@ -183,7 +183,7 @@
             }
             DATASPACE  SCALAR
             DATA {
-            (0): "db410d65-a49a-4bd8-8ec9-ad6076d272e7"
+            (0): "eb09c10a-6ac9-461b-bb44-5bccd2551a3b"
             }
          }
          DATASET "date_of_birth" {

now I wonder -- how to discover if original file had spec cached or not so I export without only if prior one didn't have it cached?

rly commented 1 month ago

The spec is cached in the hdf5 nwb file if the root hdf5 file contains an attribute named ".specloc" (the value of which is set to "/specifications" to indicate that the cached spec is in the specifications group)

rly commented 1 month ago

Alternatively, you can run pynwb.NWBHDF5IO.get_namespaces(path) which returns an empty dict if there are no cached namespaces.

rly commented 1 month ago

I believe this issue has been resolved. @yarikoptic please reopen if not.