Slow creation of ND2File object from large ND2 files

tlambert03 / nd2

Full-featured nd2 (Nikon NIS Elements) file reader for python. Outputs to numpy, dask, and xarray. Exhaustive metadata extraction

https://tlambert03.github.io/nd2

BSD 3-Clause "New" or "Revised" License

53 stars 15 forks source link

Slow creation of ND2File object from large ND2 files #50

Closed KaSaBe closed 2 years ago

KaSaBe commented 2 years ago

nd2 version: 0.2.2
Python version: 3.8.13
Operating System: Windows 10

Description

Thank you for creating this useful package. However, in my use case, creating ND2File objects from a ~30 GB ND2 file seems quite slow, taking about 30 s, while in comparison creating an ND2Reader object using nd2reader takes ~1 s. Reading of subsets of the data once the object is created seems similar. I would much prefer the ND2File version for the slicing interface and dask compatibility, but the initial overhead for creating a series of ND2File/dask objects is (surprisingly?) large. Using nd2.imread(filepath, dask = True) gives similar overhead, while running to_dask() on the ND2 file once created is less than a second. File details are below. Both CPU and drive usage are very moderate as the object is created, so there do not seem to be hardware bottlenecks.

What I Did

import nd2
arr = nd2.ND2File(filepath)
arr
--> <ND2File at ...: 'Time00010_Channel555 nm,635 nm_Seq0010.nd2' uint16: {'P': 30, 'Z': 61, 'C': 2, 'Y': 2304, 'X': 2304}>

shenker commented 2 years ago

I have also run into this issue.

@tlambert03 I think the issue is that there needs to be a way to specify fixup=False (and search_window) in top-level init methods like ND2File() and have that be passed down to _chunkmap.read_chunkmap. Because it is unexpected that file reading hangs on large files (it sort of defeats the purpose of lazy-loading for dask), I'd also suggest that fixup be changed to be False by default (currently it is True by default in _chunkmap.read_chunkmap and _chunkmap.read_new_chunkmap).

tlambert03 commented 2 years ago

Yep, thanks both, I ran into this recently as well. We can definitely speed this up by not double checking the chunkmap.

tlambert03 commented 2 years ago

(actually, @jni... this is the reason for that big delay we observed last week, it was greedily performing that chunkmap validation that I originally added to "rescue" corrupt datat)

shenker commented 2 years ago

Awesome, thanks! (Also, happy to submit a quick PR if you're busy.)

(@tlambert03: for context, this week I'm hoping to finally migrate the paulsson lab codebase over to this reader from my hacked-together pickle/memmap-enabled nd2reader fork that I've been using for the last many years...)

tlambert03 commented 2 years ago

I'll take any chance I can get to hook you into the "contributors" column here :joy:

I agree, we should make fixup default to False, but then also create a path for the ND2File.__init__ to pass the fixup (or we can name it something else like validate?) down through ._util.get_reader, into _sdk/latest.ND2Reader.__init__, and then finally back to _chunkmap.read_new_chunkmap.

tlambert03 commented 2 years ago

@KaSaBe, @shenker's fix is on pypi now (nd2 v0.2.3). Can you try when you get a chance? If you need the conda-forge version, it will be coming a little later

KaSaBe commented 2 years ago

Thank you for the rapid responses and fixes!

In my hands it does and does not work in v0.2.3.

With validate_frames = False, the loading speeds are very fast indeed, but in my case this unfortunately results in every single frame being "corrupted", i.e. frames being offset/overlapped incorrectly. So I now appreciate why you implemented the validation in the first place. It does seem to read the frames correctly when validate_frames = True. I did discover that depending on the source (different network drives or local) the time it takes to validate the frames varies by an order of magnitude (even if very little bandwidth is actually being used, perhaps a latency issue), so a workaround for me is to ensure that I have the files on a fast connection, then the overhead is not too severe.

P.S.: I realized now that the above also means that by changing the default behavior, nd2.imread(file, dask=True) exclusively returns corrupted frames (with my data) and does not have the option to validate them either, which could cause some confusion for users encountering this.

tlambert03 commented 2 years ago

Ohhh, you're also reading something over the network. That helps to explain why I haven't seen this. I do have some local 30 Gb files (some of which are partially corrupted) and i had never seen long delays yet.

Yeah, it's quite unfortunate that nd2 uses these massive single files, it makes it extremely easy for a tiny byte offset error to propagate crap frames throughout the dataset.

Would you be willing to share that big dataset via Dropbox or something, so I can have a look?

tlambert03 commented 2 years ago

Would you be willing to share that big dataset via Dropbox or something, so I can have a look?

Thanks for the files @KaSaBe! In the end, your files were totally fine (didn't need the validate_frames fix at all)... This was a latent bug here that is fixed in #54