hyperspy / hyperspy

Multidimensional data analysis
https://hyperspy.org
GNU General Public License v3.0
508 stars 208 forks source link

Zeiss format Date and Time missing in metadata #2057

Closed sem-geologist closed 5 years ago

sem-geologist commented 6 years ago

While doing #2056 I noticed that Date and Time is not mapped to metadata of hyperspy. My inspection showed that these parameters are not written to CZ tiff consistently, compared with other metadata. The work around needs to be included in external libraries, which hyperspy depends from for tiff reading.

The progress:

sem-geologist commented 6 years ago

One more consideration, As Tifffile is available on pipy, wouldn't it be better idea to put it into dependency instead of skimage? (I think the same is logical for skimage, just to put it into dependecy). Particularly, keeping in mind the split of hyperspy. skimage is reading, writing and processing of images where Tifffile is only reading and writing.

ericpre commented 6 years ago

Although, this would be the proper way to do it, I suspect that scikit-image it packaging it in order to have the c code to build correctly. On some platform, installing tifffile from pypi will not build the c code correctly if no compiler is available. In any case, if scikit-image is doing this, there should be a good reason, maybe just legacy?!

sem-geologist commented 6 years ago

It is actually quite huge problem, skimage have quite old tifffile, due to that there hyperspy io_plugins/tiff.py also have to do all kind of workaround for problems, which are long ago solved in newer tifffile. I. e. I could not work normally with the dataset of 1000 of Zeiss FIB slices, as I was constantly running out of memory (16GB) on workstation. Loading took much more memory than it was supposed theoretically, and then such simple actions as aligning images was taking ages. due to memory issues falling back to lazy methods was even more painfully slow. Hyperspy, unfortunately, falls really bad if compared i.e. with ImageJ for loading, aligning and presenting similar FIB dataset on 4GB laptop, and is a bit disappointing.

I am not going to learn ImageJ, I want that hyperspy would work better (I am sure that proper python project can outperfom java). As I inspected what is wrong, I have found that old Tifffile (used by skimage) was applying 48bit RGB colormap on 8bit data array...ghmhm... producing 3channel 16bit numpy arrays(!) – 6 times larger unnecessary memory IO! Of course these overhyped arrays are passed over to HyperSpy where tiff.py have some workaround to strip away two channels. Anyway with old Tiffifle the 8bit zeiss image loaded in HyperSpy is present as 16bit array. There is of course other hyperspy issues making insane memory and cpu processing time waste (i.e. #1398, which I am preparing to rant more on).

Finally, compilers is the problem for windows and that's why hyperspy comes with bundled version. or Anaconda can be used (tifffile exists also as conda package). problem can be android, ios, or someother platforms, but believe me, on those platforms you will get problem with other libraries to compile for hyperspy (i.e. I tried to compile h5py on aarch64 android bionic, impossible). I have started on separate private branch trying to to adapt the tiff.py to newer tifffile. The results I get is really nice speedup in loading of tifs, and finally I can work without lazy, and sane memory usage.

jeinsle commented 6 years ago

quick question if the FIB-tomography stack images, 8-bit native? I know that much of my FIB-tomo work on Zeiss systems is quite a bit bigger due to the bigger bit depth of the images I capture.

sem-geologist commented 6 years ago

How old Zeiss systems are we talking about? problem is that our 4 year old system saves FIB slices as indexed (from 0 to 255), where colour palette table is 48bit RGB. there is no additional information resolution, only bullshit zeiss expansion of memory usage

sem-geologist commented 6 years ago

actually on our system, (from 2014) they have option of saving images 16bit grayscale, but warns that there wont be any more information saved. It is artificial conversion of 8bit to 16bit... just in case, you know, if you have too much RAM in your computer

jeinsle commented 6 years ago

ours is new, but when I started collaborating with Zeiss 3/4 years ago, they were surprised to hear that FEI systems were still 8-bit by default.... so I would double check settings as I think some of the more recent systems have 'an ability' for 16-bit and this might be part of your problem... just a thought, not enough direct experience with older Zeiss microscopes.

sem-geologist commented 6 years ago

@jeinsle what is the size of single slice (in MB) and what resolution (1k,2k,3k,4k?)

jeinsle commented 6 years ago

@sem-geologist I can get that for you tomorrow. I am out of the office today and left the data on a workstation. think it was roughly a 4k wide image, with about 500 mb slice....

sem-geologist commented 6 years ago

well it is easy to check if this 16bit is real dynamic range, or faked up. Take an image with 16bit, load it as numpy array, check for unique values, and get the lenght of that array of unique values. If you get =< 256 well, you have then artificially by software bulged files, if it is much more than 256, then congratulations for good microscope.

sem-geologist commented 6 years ago

500 MB, it is probably whole dataset of FIB slice collection, 16bit slice at 4k should take about 26MB... unless your machine is taking 128bit images, I would not believe something like that would exist or be useful.

jeinsle commented 6 years ago

@sem-geologist aye as I said, I will check tomorrow when I am actually in front of the data. Now that I am thinking about it, I do not think the automated bit of the zeiss software (or fei for that matter... well maybe newer but again I have not worked with that) allows for 16bit saving, but manually my other images are defiantly 16 bit as they are absurdly massive.

cgohlke commented 6 years ago

As I inspected what is wrong, I have found that old Tifffile (used by skimage) was applying 48bit RGB colormap on 8bit data array...ghmhm... producing 3channel 16bit numpy arrays(!) – 6 times larger unnecessary memory IO!

That's because those files claim to be palette color images and according to the TIFF specifications the color palettes are 16 bit per color channel. I have removed color mapping from tifffile because almost all scientific formats are using colormaps not according to the spec but for visualization.

sem-geologist commented 6 years ago

@cgohlke, exactly, and I wish hyperspy would use newer version of your library, so this problem would just go away.

Zeiss after milling produces multipage tiff additionaly to single files. Using the old version, as is used in hyperspy, I even could not open such multipage tiff datasets without lazy=True method on the powerful workstation (16GB RAM, xeon processors, really nice machine). After I had enabled newest Tifffile version available through pip locally on my laptop, I can easily load it up all into memory (takes initially about only 0.5GB). It is cardinally different experience, it loads and works smoothly as it is supposed to be, and is faster than ImageJ.

There is however another performance brake which can be experienced loading the same dataset with stack (multiple tiff files), I am 100% sure it still so due to #1398. It is due to original_metadata concatenation, where single multipage tiff does not suffer from that, as tiff.py ignores metadata from all except first page. So after fighting a week with hyperspy, finally I can use it in production (not latest stable, but locally modified version). I think my work in cleaning/adjusting io_plugins/tiff.py is going to be anyway valuable if/after I could convince skimage to update version there.

sem-geologist commented 5 years ago

@ericpre , how many tif formats there is in microscopy which uses internal compression? As much I see the C code deals only with decompression. @cgohlke , could you elaborate on this a bit more? I see that in latest versions of tiffile the _tifffile.c had gone/ were transformed-expanded into imagecodecs. I see that some kind of windows media format is preventing the thing (imagecodecs) from being compiled on posix. I think I know why skimage still uses so old tifffile, and guess it will be impossible to convince them to update lossing part of decompression functionality. It is a bit dead-end, unless we don't need fancy decompression of tiffs. I also see that imagecodecs have c/pyx implementations as also partial pure-python implementation, maybe that would be enought for our needs?...

cgohlke commented 5 years ago

how many tif formats there is in microscopy which uses internal compression

Zeiss LSM optionally uses LZW compression (in a TIFF standard incompatible way). JPEG and J2K compression are common in tissue imaging.

some kind of windows media format is preventing the thing (imagecodecs) from being compiled on posix

There is no (Windows) platform specific code in imagecodecs. You'll have to install/build the documented 3rd party libraries and probably adjust the library names and locations in setup.py to build the Cython extension.

Tifffile.py can be used without the imagecodecs package but only zlib deflate compression is supported.

The imagecodecs package has a Python+numpy fallback that implements a limited subset of the functionality of the Cython/C extension.

sem-geologist commented 5 years ago

@cgohlke by windows I had meant Microsoft actually, and the format I am talking about is jpegxr. Are some Tiff's using that technology? or is imagecodecs aiming for more wide support of image formats than tif?

ericpre commented 5 years ago

@ericpre , how many tif formats there is in microscopy which uses internal compression?

Sorry, I have no idea.

Scitkit-image used to update regularly the tifffile.py version packaged in scikit-image, which is very convenient because scikit-image can be installed easily on most systems (if not all!), as far I understand. If they continue to do so, this is great and it is worth asking if their can update their tifffile.py version.

cgohlke commented 5 years ago

... jpegxr. Are some Tiff's using that technology? or is imagecodecs aiming for more wide support of image formats than tif?

jpegxr is commonly used in Zeiss CZI files.

I think one reason why skimage is slow on updating vendored tifffile is that skimage.io is to be replaced by the imageio package.

sem-geologist commented 5 years ago

@cgohlke , thanks - that is very valuable information.

ericpre commented 5 years ago

xref https://github.com/scikit-image/scikit-image/issues/1605 and https://github.com/scikit-image/scikit-image/issues/2436#issuecomment-410977172.

For the future, we may want to use imageio directly instead of going through scikit-image which uses or (will use from 0.15) imageio!

The problem at the moment is that imageio uses an old version of tifffile.py and doesn't packaged the C code while scikit-image does it...

sem-geologist commented 5 years ago

@ericpre , and it brings me again at the question I have asked, how many electron microscope formats are using fancy compression (anything other than zlib) in tiff? At least we have none such file in tiff_test directory, so I guess there is no problem, or we had not run into the case where we actually need the c library, the c library is only used for decompression. I looked over imageio code, they ship tifffile.py without the c, however they check if tifffile is not available, and use then that instead of shipped one. The problem is that pip available tifffile is only as source-package (at least on posix) and needs to be compiled, so that solves not much. Furthermore, @cgohlke updated versions of tifffile is available through pip as tiffile (mind the 2 'f' instead of 3 'f'), where the c code is moved out to other package into imagecodecs... which I still have to figure out how to install (simple pip install imagecodecs wont work).

The further problem is that tifffile api is still not stable, and it is changing (If you try to switch the version, you won't get expected results, fortunately I have sorted out the changes required to be done in the hyperspy to use the most recent version of tiffile.

so the versions:

I don't think that either imageio or skimage is going to keep track with latest tiffile as it is a bit pain as api is changing. I think we could ship it inside hyperspy, and particularly after the io_plugin split. I can commit to maintain tifffile inside hyperspy /and later in io_plugins library. I see the direct inclusion of tiffile.py (and starting with io_plugin library when license will change to bsd imagecodecs) the shortest and simplest path to get the things done right.

ericpre commented 5 years ago

imageio have a stable API, which should be fine to use and seems to be the library to read image for scipy/scikit, so it should be well supported and future proof. If we include a tifffile.py in hyperspy, we are going to do the same thing than imageio... I think that it would make more sense that you commit yourself to maintain tifffile.py in imageio, which should be more than welcome! :wink:

@ericpre , and it brings me again at the question I have asked, how many electron microscope formats are using fancy compression (anything other than zlib) in tiff?

Maybe we can leave the C code of tifffile.py for now, until someone complain about it! I was also a bit confused with the tifffile versus tiffile in pypi!

sem-geologist commented 5 years ago

imageio API can be stable, but tifffile is not. And imageio API returns numpy array, we however need also rich tag data which we need to parse into metadata, and axis parameters. This is one of the difference with imageio and also upcoming Hyperspy-io_plugin library is having an advantage compared with other available libraries. Hyperspy already suffers with constant numpy and matplotlib API deprecation (which are generally depicted as having the stable API) making surprises for hyperspy test failures nearly every months. Tiffile sitting inside hyperspy dir would be changed only and only then some feature needs the support present in the newer Tifffile or some bug fix needs it. That would make no surprises. I really hope that is only temporarily, as one day Tiffile will get to stable API, I guess then we also will see the wheels of the library and thus will then pay-off to get rid of it from hyperspy.external.

I can look furthermore into imageio and try to figure out if its API is helpful for us. At this moment I have near ready PR with the new tifffile. If imageio would allow to simplify more our library, then yes, we can consider imageio (maybe it provides even more than tiff what we could use). But at this moment I am not sure if it would remove complexity, or we should translate one complexity into another.

ericpre commented 5 years ago

imageio API can be stable, but tifffile is not. And imageio API returns numpy array, we however need also rich tag data which we need to parse into metadata, and axis parameters.

The user API of imageio says that it should be possible to good control on how to access metadata thought the reader itself or the metadata attached to the numpy array (imread seems to return a subclass of np.ndarray). This sounds to me that it should work nicely.

If imageio would allow to simplify more our library, then yes, we can consider imageio (maybe it provides even more than tiff what we could use).

Indeed, I believe this will be the case. it is most likely that the imageio library will do a better job than what do here for other image format.

sem-geologist commented 5 years ago

@ericpre , @francisco-dlp , I had inspected possibilities with imageio. At the current state it is worthless: too little information is exposed. (I tried both ways using imread and inspecting meta from returned array, or using get_reader way, which does not give anything more than returned_array.meta). imageio does not return any sensible metadata (except some limited stuff for imagej format tiffs). No information for scales, offsets, units, operator_name, bank accounts... Yes, at first impression it looks attractive kitchen sink with all those available iToster formats, using 3 back-ends... Scruff which have nothing to do with electron microscopy. I think if it would return /dev/null instead of .meta that could be at least useful in some cases. Another struggle is that by attaching metadata to same array the lazy methods would be broken. Actually I could not find anything at imageio for lazy loading.

I would go with tiffile.py embedding into hyperspy.external (and latter with io_plugins_library.external) while tiffile's API is still unstable. I can commit some of my time to maintain this.

sem-geologist commented 5 years ago

2064 is ready.

ericpre commented 5 years ago

2064 is merged.