py4dstem / py4DSTEM

GNU General Public License v3.0
199 stars 135 forks source link

Integration with hyperspy #409

Open CSSFrancis opened 1 year ago

CSSFrancis commented 1 year ago

Hi all,

I was just looking at some of the open issues and I wanted to open up a discussion about integrating with hyperspy.

Many of the issues currently open have already been solved including

  1. Parallel processing on all functions including custom ones using the map function.
    • This scales to 1,000's of cores in the case of peak finding and we can operate streaming the data.
    • It handles both in and out of memory operations using the same function
    • It also handles rechunking (if necessary) and can be used pretty much out of the box even in situations where you are limited by tour ram
    • This issue has been long standing in py4dstem and using some well established and tested code would be great
  2. Metadata Handling
    • Hyperspy has a fairly developed system for handling metadata across many systems. This includes the ability to save axes units and use real units.
    • The Axes manager allows you to go to higher than 4 dimensions. Currently py4dstem is limited to 4 dimensions and to expand some of the methods to higher dimensional datasets requires you to rethink the enviornment.
  3. More file loading options
    • While the work at https://github.com/hyperspy/rosettasciio does allow py4dstem to load a large number of file formats there are a couple of implementations such as using zarr that are major speed and it would be easier to use hyperspy to handle those.
    • Support for saving and loading vectors.
  4. Interactive plotting
    • Something hyperspy does very well is interactive plotting and that could easily be added by subclassing the signal2d class
    • Plotting of dask arrays is something that was improved https://github.com/hyperspy/hyperspy/pull/2568 so that the current chunk is cached.
  5. Matrix Machine Learning
    • There are lots of NNMF and PCA and SVD tools that would be good to integrate into py4dstem that are already available.
  6. Better integration into the community
    • Currently there is a lot of duplication of efforts, a lack of sharing in the community and this has really hindered the microscopy community. Pangeo and astropy both are great because they have the full force of the community behind them. Currently there are multiple 4-D STEM packages in a very small field with a limited number of activate developors. Many of the issues and challenges faced one place are duplicated other places.

Hyperspy is splitting out the EELS and EDS analysis which makes the base package much lighter and more manageable as a dependancy. It also has a very nice interface for adding signals and creating add ons. It also make the possibility of interfacing with pyxem much easier but just casting the signal type.

Integrating would be a fair bit of work but not so insurmountable and probably more on the order of a week or two rather than months. Replacing the Array class https://github.com/py4dstem/py4DSTEM/blob/3dfb79d4bd0f9184a23da571d953ee3a823117c4/py4DSTEM/io/datastructure/emd/array.py#L12 with hyperspy's Signal2D would be easy. All of the current attributes have 1:1 mappings to equivalents in hyperspy and you could easily use the @property decorator to maintain the same syntax.

CSSFrancis commented 1 year ago

@bsavitzky I saw you self assigned this. If this is something you are actually interested in I think that this is important enough that I would spend a decent amount of time to make it a reality.

Just let me know. I'm also open to talking about merging functionality with pyxem but that might be a little more difficult.

bsavitzky commented 1 year ago

Hey @CSSFrancis! Thanks for opening the issue - I think it's definitely a worthwhile discussion to have.

I do have some reservations. In a number of cases, the tools or functionalities you've identified are things we've built but haven't been moved into the main branch / PyPI yet. That said - not all of them. My other / primary reservation is the added complexity involved - I know hyperspy has split off a number of its elements and that this has simplified its dependencies some, but it is still is a pretty heavyweight package.

Ok, details. First off - py4DSTEM version 0.14, which is currently on the development branch and is scheduled to launch on main/PyPI in two weeks, moves basic data structure and IO tools and management out of py4DSTEM and into emdfile (https://github.com/py4dstem/emdfile), which in turn is now a dependency. emdfile is a standalone, lightweight Python / HDF5 interface, includes arrays with axes metadata of arbitrary dimensions, and handles arbitrary metadata dictionaries, among other things. I believe in v14 the issues you bring up in point (2) are handled with this change. This also means that if we were to integrate, something like modification of the Array class (to inherit from Signal2D?) would need to happen at the level of emdfile. The primary design principle for emdfile was aiming for a simple and lightweight interface for complex structured data. Its current dependencies are h5py, tqdm, numpy, and scipy. (And scipy isn't really needed and likely will get removed soon). Adding hyperspy would be a huge new dependency.

File IO is pretty important, and I think integrating with rosettasciio is very much worth discussion and perhaps/hopefully implementation - we do not currently support enough formats! Integrating with seems like it would be a lot simpler and easier than integrating with hyperspy, as rosetta appears to be much lighter / has many fewer dependencies, and would get us many of the benefits.

Both matrix decomposition, as well as parallelization, are things we've built and included at various points on various branches for various specific functionalities - but as you've correctly identified, we do still lack a unified framework which has been clean and sensible enough to be easily applicable across functionalities and on the main branch. I completely agree that including a simple / accessible / unified interface matrix decomposition methods (NNMF / SVD / PCA / etc) is a very good idea that's worth doing. Even more so, a single more sensible framework for parallelization, device config, and job distribution is very, very important. We have a number of pieces of our code which are now GPU compatible, and a number of pieces of our code which can be distributed to a cluster/multinode system, but - it's all pretty piecemeal now. So yes! This needs to be simplified/streamline. The key thing for me is both a user interface / function call signatures as well as source code structure that is clean and sensible. My feeling is that we have most the pieces here, for both matrix methods and parallelization, and whats left is putting them together with a(n as) simple (as possible) interface - and that adding another hefty package's methods or classes to do this may make that more and not less complicated. I could be convinced otherwise, but this is my sense of things now.

Interactive plotting is also something we've gone back and forth on a number of times. @cophus has been suggesting we use napari, and there are quite a few good reasons to go this route - it's a fairly mature and well supported package, is surprisingly relatively lightweight, and would allow us to build on / with many of our existing (static) visualization methods. Whether or not we go this route, we will very likely roll in functionality of this sort in the near future. I don't know if we'll end up using napari or not... I am curious to discuss what including/building these tools through hyperspy integration instead would look like. Either way, broadly speaking my experience has been that interactive plotting can be a bit of a morass, and whichever solution we go with, I want to be careful about how limiting it is in terms of which systems and environments its compatible with, and what kinds of additional setup/install burden are entailed. In my mind the worst case scenario here is that we build or integrate tools for this which enable cool/exciting plotting on some systems, and break the package on others. Or, which require a whole new set of install/debug/system-specific steps which may dissuade or be inaccessible to some users.

So these are my reservations. I'm not committed to any answer at this point, and truthfully don't know enough about hyperspy's internals to make any careful judgement, but did want to lay that all on the table.

Last: I'm appreciative of you opening the dialogue, and beyond any specifics, I think your point (6) is really important. I would love to support / foster / work towards better collective effort in the community of folks building open source tools in our space. I don't know whether or not that will look like new package dependency relationships or some integration of source codes or something else. I don't think doing that necessarily requires integration, per se, but it could. Even if we don't end up integrating there I think there is still room for more communication, collaboration, and collective effort.

CSSFrancis commented 1 year ago

@bsavitzky Thanks for getting back to me. There are a lot of good points here and I would love some additional feed back

Hey @CSSFrancis! Thanks for opening the issue - I think it's definitely a worthwhile discussion to have.

I do have some reservations. In a number of cases, the tools or functionalities you've identified are things we've built but haven't been moved into the main branch / PyPI yet. That said - not all of them. My other / primary reservation is the added complexity involved - I know hyperspy has split off a number of its elements and that this has simplified its dependencies some, but it is still is a pretty heavyweight package.

I looked at the difference in dependencies and most of them are fairly light weight with the exceptions of a couple of packages that you also have as extra dependencies. Hyperspy loads decently quickly these days see (https://github.com/hyperspy/hyperspy/pull/2850) and a large portion of that is related to numpy. That being said I don't see this as a huge problem.

'dask[array]>=2021.3.1',
'importlib-metadata>=3.6',
'jinja2',
'natsort',
# non-uniform axis requirement
'numba>=0.52',
'numexpr',
'packaging',
'pint>=0.10',
# prettytable and ptable are API compatible
# prettytable is maintained and ptable is an unmaintained fork
'prettytable',
'python-dateutil>=2.5.0',
'pyyaml',
'requests',
'sympy>=1.6',
# UPDATE BEFORE RELEASE
'rosettasciio @ git+https://github.com/hyperspy/rosettasciio#egg=rosettasciio',
'tqdm>=4.9.0',
'traits>=4.5.0',

Ok, details. First off - py4DSTEM version 0.14, which is currently on the development branch and is scheduled to launch on main/PyPI in two weeks, moves basic data structure and IO tools and management out of py4DSTEM and into emdfile (https://github.com/py4dstem/emdfile), which in turn is now a dependency. emdfile is a standalone, lightweight Python / HDF5 interface, includes arrays with axes metadata of arbitrary dimensions, and handles arbitrary metadata dictionaries, among other things.

Is there a good reason not to contribute this directly to rosettasciio rather than making your own package? This frankly seems like another standalone package which would be required as an extra dependency for other people if they want to use it.

I believe in v14 the issues you bring up in point (2) are handled with this change. This also means that if we were to integrate, something like modification of the Array class (to inherit from Signal2D?) would need to happen at the level of emdfile. The primary design principle for emdfile was aiming for a simple and lightweight interface for complex structured data. Its current dependencies are h5py, tqdm, numpy, and scipy. (And scipy isn't really needed and likely will get removed soon). Adding hyperspy would be a huge new dependency.

This isn't really the case. After all rosettasciio has no hyperspy dependency. The signal2D Class just is duck-typed numpy array that can handle dask or numpy arrays much like the DataCube object.

File IO is pretty important, and I think integrating with rosettasciio is very much worth discussion and perhaps/hopefully implementation - we do not currently support enough formats! Integrating with seems like it would be a lot simpler and easier than integrating with hyperspy, as rosetta appears to be much lighter / has many fewer dependencies, and would get us many of the benefits.

Somewhat but you don't handle dask arrays particularly consistently across the package. At least everytime I have tried to do something I end up having to load everything into RAM. This is a part of the reason why I think you need at least dask as a dependency. Ultimately dask and not numpy powers hyperspy. Or at least dask acts as the director while numpy does the actual work.

Both matrix decomposition, as well as parallelization, are things we've built and included at various points on various branches for various specific functionalities - but as you've correctly identified, we do still lack a unified framework which has been clean and sensible enough to be easily applicable across functionalities and on the main branch. I completely agree that including a simple / accessible / unified interface matrix decomposition methods (NNMF / SVD / PCA / etc) is a very good idea that's worth doing. Even more so, a single more sensible framework for parallelization, device config, and job distribution is very, very important. We have a number of pieces of our code which are now GPU compatible, and a number of pieces of our code which can be distributed to a cluster/multinode system, but - it's all pretty piecemeal now. So yes! This needs to be simplified/streamline. The key thing for me is both a user interface / function call signatures as well as source code structure that is clean and sensible. My feeling is that we have most the pieces here, for both matrix methods and parallelization, and whats left is putting them together with a(n as) simple (as possible) interface - and that adding another hefty package's methods or classes to do this may make that more and not less complicated. I could be convinced otherwise, but this is my sense of things now.

As I have spent the last 2 years rewriting code to make this run smoothly in hyperspy I can pretty confidently say this is not as easy as you think. I also think that this is something where the moment that someone tries to put some 5 dimensional data or 6 dimensional data into py4dstem everything is going to break.

Let's see if I can give some examples:

def get_single_origin(dp, r=None):
    _qx0, _qy0 = np.unravel_index(
                    np.argmax(gaussian_filter(dp, r)), (self.Q_Nx, slef.Q_Ny))
    _mask = np.hypot(qxx - _qx0, qyy - _qy0) < r * rscale
   return  get_CoM(dp * _mask)

# Handles numpy, cupy and dask
class DataCube(Signal2D):
    def find_peaks(self, **kwargs):
        return s.map(_find_Bragg_disks_single_DP_FK, ragged=True,  **kwargs)
    def fit_origin(self, **kwargs)
        return s.map(get_single_origin, **kwargs)

I can go on if you would like but at the end of the day with ~10 lines of code that makes your function for getting the origin parallel as well as converts your find peaks function to running in parallel and dask compatible. It also is much more legible. It also works when someone comes along with a 5 or 6 dimensional dataset now.

The key is really using dask for everything.

Interactive plotting is also something we've gone back and forth on a number of times. @cophus has been suggesting we use napari, and there are quite a few good reasons to go this route - it's a fairly mature and well supported package, is surprisingly relatively lightweight, and would allow us to build on / with many of our existing (static) visualization methods. Whether or not we go this route, we will very likely roll in functionality of this sort in the near future. I don't know if we'll end up using napari or not...

So maybe try hyperspy interactive plotting first, especially after https://github.com/hyperspy/hyperspy/pull/2568 hyperspy can plot data out of memory very effectively. It also can plot up to like 6 dimensions and we are working on some pretty big speed and visualization improvements. https://github.com/hyperspy/hyperspy/pull/3148. I plot multi-terabyte datasets with hyperspy fairly easily.

I am curious to discuss what including/building these tools through hyperspy integration instead would look like.

It depends what you want to do. If you extend the signal2D class...

# plot static 
datacube.plot()
#plot interactive
%matplotlib qt 
#%matplotlib widget
datacube.plot()
# adding points from diffraction vectors. 
vectors = hs.markers.Point(vectors.) # this might take a little bit to get them correct
datacube.add_markers(vectors)
datacube.plot() # plot interactive with markers overlaid

Either way, broadly speaking my experience has been that interactive plotting can be a bit of a morass, and whichever solution we go with, I want to be careful about how limiting it is in terms of which systems and environments its compatible with, and what kinds of additional setup/install burden are entailed. In my mind the worst case scenario here is that we build or integrate tools for this which enable cool/exciting plotting on some systems, and break the package on others. Or, which require a whole new set of install/debug/system-specific steps which may dissuade or be inaccessible to some users.

So this is part of the reason that the hyperspy bundle exists. I doubt napari will be terribly much easier to work with. It does make running sessions a little more difficult but not having interactive plotting is a huge help. Besides the different ways to do things exist for a reason, (ie if you are running remotely or on a local computer). Because napari is built on top of qt you can't run things remotely which for people who are using jupyter on a computing cluster makes it pretty much useless.

So these are my reservations. I'm not committed to any answer at this point, and truthfully don't know enough about hyperspy's internals to make any careful judgement, but did want to lay that all on the table.

Yea I realize I maybe don't fully understand the entirety of py4dstem either but I can tell you that hyperspy is pretty fast across the board. There are some functions which could probably be 1-2 times faster if they didn't use the map function but that requires us to write custom code that is hard to read, hard to write and hard to maintain.

Last: I'm appreciative of you opening the dialogue, and beyond any specifics, I think your point (6) is really important. I would love to support / foster / work towards better collective effort in the community of folks building open source tools in our space. I don't know whether or not that will look like new package dependency relationships or some integration of source codes or something else. I don't think doing that necessarily requires integration, per se, but it could. Even if we don't end up integrating there I think there is still room for more communication, collaboration, and collective effort.

So this is worth more discussion but:

  1. Think about using hyperspy a. I really think that it is the right move but I am a little biased I know it takes away some of your agency but there are only a handful of people in the EM community actively developing software so splitting into different camps tends to hurt a fair bit. b. I will try to do a decent amount of the work getting things implemented. (or at least seeing if I can get things implemented.)
  2. At the very least if you don't implement hyperspy you should make a generic map function similar to what is in hyperspy. This requires you to use dask but I can't recommend it enough.

We can at the very least make a feed stock of 2-D methods then for doing electron diffraction which is completely tested and only depends on numpy. In that case LiberTEM can also use them with the UDF function as well. Maybe that is the easiest place to start.

CSSFrancis commented 1 year ago

Let me make a new issue for the last point.

ericpre commented 1 year ago

@CSSFrancis already covered most of the relevant points and I will just add some details.

Interactivity

Napari is very good, particularly for interactivity, has a very strong community but it is very much focused on life science workflows where working with layers, labelling, 3D volumes is very important. Even if it can be used for our application, and I wish that we could use it, it has several drawbacks:

Dependencies:

Large data and interactively

Maybe one actionable of this discussion is to identify more precisely what are the blockers from py4DSTEM perspective for using rosettasciio and/or hyperpsy and discuss if/how it can be solved?

CSSFrancis commented 1 year ago

@ericpre Thanks for articulating some of those points. I've always found qt great but a little frustratingly complex so it's good to put a finger on exactly why.

Matplotlib I think gets a lot of undue hate when it gets compared to newer (fancier) plotting libraries but I think there is a reason that it remains a solid part of any scientific ecosystem. There are times when it is slow but I've always found that there is always a way to plot things faster if you look into their documentation. The only thing that is a little lackluster is the 3-D plotting tools but I kind of find 3D plotting, while cool to look at, pretty useless to try and do any analysis. That's part of the reason I really like hyperspy's format of separating dimensions.

For interactive visualisation of large dataset, there is room for further speed up with caching decoded chunks (https://github.com/zarr-developers/zarr-python/pull/1214). When I tried 2 years, it was very impressive

This is going to be really cool when it gets merged :)

Maybe one actionable of this discussion is to identify more precisely what are the blockers from py4DSTEM perspective for using rosettasciio and/or hyperpsy and discuss if/how it can be solved?

+1

bsavitzky commented 1 year ago

Hey folks - thanks for the lively discussion! I'll try to take detail / specific topics one-by-one, and then summarize where our team's thinking is on the high level questions at the end.

Details

Dependencies

I looked at the difference in dependencies and most of them are fairly light weight with the exceptions of a couple of packages that you also have as extra dependencies.

HyperSpy dependency management let the user decide what type of dependency is required

So this is part of the reason that the hyperspy bundle exists.

Our extra dependencies are extras intentionally, to keep install as light and simple as possible for users who don't need them (which I believe is a majority). The bundle installer is great - the more granular control this enables is clearly quite useful, and having this available is plainly very powerful for developers and power users. I admire the breadth and control of functionality HyperSpy's installation model makes possible. That said, my preference is for most users to not need to think or worry about dependencies or dependency management at all. For now the dev team discussed and agreed that we are not planning to add HyperSpy as a py4DSTEM dependency at this time.

Dask

Somewhat but you don't handle dask arrays particularly consistently across the package

As I have spent the last 2 years rewriting code to make this run smoothly in hyperspy I can pretty confidently say this is not as easy as you think

The key is really using dask for everything

This is another area where we have different models - I think they both have advantages and disadvantages. Your fully dask-ified infrastructure is very cool and I'm excited to see what kinds of things you all do with it (perhaps at the M&M??). I did not mean to suggest that this is or will be easy - 2 years sounds like a pretty reasonable time frame for a task of this magnitude! - what I mean to say is that we have spent time thinking about and discussing distributed and parallelized infrastructure, have built code for this on some individual targeted applications, and are intentionally taking the step of scaling an underlying architecture pretty slowly. A single, uniform approach for distributed processing has advantages and disadvantages. I'd like the approach we land on to enable but not necessitate dask handling everything, for instance. We will definitely be building out infrastructure and a user-facing interface for this in the future, however the exact structure and substructure of such an interface remains TBD. While I think we will not do this with a simple import of HyperSpy, help from interested developers who have spent time building out these sorts of structures would certainly be very welcome :)

Interactive plotting

Napari is very good, particularly for interactivity, has a very strong community but it...

Matplotlib I think gets a lot of undue hate when it gets compared to newer (fancier) plotting libraries but...

These are helpful insights. I'm going to punt on this point for now - mostly because, as I said previously, this topic can be a bit of a morass, and I think there's enough discussion here for now to have addressed the main topic of this issue thread!

zarr

  • As Carter already mentioned, using zarr files (zspy is the counterpart of hspy for the zarr format) is currently by far the best choice for its performant and easy of use - same syntax as h5py

This is also a good point. Integrating emdfile with zarr might be the way to go here.... not sure!

emdfile and Rosetta?

Is there a good reason not to contribute this directly to rosettasciio rather than making your own package?

It seems to me they are fairly different pieces of software with fairly different scopes. I'm curious why you think this would be a good idea?

High Level

The issue is about integration. Integration might mean a few different things, so I'm going to try to break this down a little into various specific possibilities. Broadly, I'll consider (1) interoperability of file formats, and (2) dependency.

(1) Interoperability

py4DSTEM should support reading HyperSpy files. It would also be great to build in Rosetta for more broad format support - more on that below.

(2) Dependency

Let "x > y" mean "x is a dependency y".

HyperSpy > py4DSTEM

The consensus of the dev team was that we're not going to add this dependency now.

py4DSTEM > HyperSpy

It does not sound like this is what you were proposing, but was not addressed explicitly one way or the other, so thought I would just add a note. If the HyperSpy team is interested in adding a py4DSTEM dependency and discussion or support from our team would be helpful, please don't hesitate to let us know.

RosettaSciIO > py4DSTEM

This seems like a good idea!

emdfile > RosettaSciIO

emdfile is very light, and if the rosetta team would like to support the new EMD format, I think importing emdfile.read would be the way to go.

CSSFrancis commented 1 year ago

Hey folks - thanks for the lively discussion! I'll try to take detail / specific topics one-by-one, and then summarize where our team's thinking is on the high level questions at the end.

Details

Dependencies

I looked at the difference in dependencies and most of them are fairly light weight with the exceptions of a couple of packages that you also have as extra dependencies.

HyperSpy dependency management let the user decide what type of dependency is required

So this is part of the reason that the hyperspy bundle exists.

Our extra dependencies are extras intentionally, to keep install as light and simple as possible for users who don't need them (which I believe is a majority). The bundle installer is great - the more granular control this enables is clearly quite useful, and having this available is plainly very powerful for developers and power users. I admire the breadth and control of functionality HyperSpy's installation model makes possible. That said, my preference is for most users to not need to think or worry about dependencies or dependency management at all. For now the dev team discussed and agreed that we are not planning to add HyperSpy as a py4DSTEM dependency at this time.

Dask

Somewhat but you don't handle dask arrays particularly consistently across the package

As I have spent the last 2 years rewriting code to make this run smoothly in hyperspy I can pretty confidently say this is not as easy as you think

The key is really using dask for everything

This is another area where we have different models - I think they both have advantages and disadvantages. Your fully dask-ified infrastructure is very cool and I'm excited to see what kinds of things you all do with it (perhaps at the M&M??). I did not mean to suggest that this is or will be easy - 2 years sounds like a pretty reasonable time frame for a task of this magnitude! - what I mean to say is that we have spent time thinking about and discussing distributed and parallelized infrastructure, have built code for this on some individual targeted applications, and are intentionally taking the step of scaling an underlying architecture pretty slowly. A single, uniform approach for distributed processing has advantages and disadvantages. I'd like the approach we land on to enable but not necessitate dask handling everything, for instance. We will definitely be building out infrastructure and a user-facing interface for this in the future, however the exact structure and substructure of such an interface remains TBD. While I think we will not do this with a simple import of HyperSpy, help from interested developers who have spent time building out these sorts of structures would certainly be very welcome :)

I'm giving a talk at M&M on using pyxem/ hyperspy to process very large datasets. I can show some of things that hyperspy/pyxem can do and we can talk there as well in person. I'm willing to talk about how to do some of these things and review code if you would like but there isn't terribly much motivation for me to duplicate these efforts in py4dstem, I have my hands full with pyxem most of the time.

Interactive plotting

Napari is very good, particularly for interactivity, has a very strong community but it...

Matplotlib I think gets a lot of undue hate when it gets compared to newer (fancier) plotting libraries but...

These are helpful insights. I'm going to punt on this point for now - mostly because, as I said previously, this topic can be a bit of a morass, and I think there's enough discussion here for now to have addressed the main topic of this issue thread!

Sounds good :)

zarr

  • As Carter already mentioned, using zarr files (zspy is the counterpart of hspy for the zarr format) is currently by far the best choice for its performant and easy of use - same syntax as h5py

This is also a good point. Integrating emdfile with zarr might be the way to go here.... not sure!

If you take one thing away from this discussion just try to rewrite all of the emdfile code with zarr.

emdfile and Rosetta?

Is there a good reason not to contribute this directly to rosettasciio rather than making your own package?

It seems to me they are fairly different pieces of software with fairly different scopes. I'm curious why you think this would be a good idea?

Maybe I don't really understand the emdfile package but my understanding is that it is for loading .emd files and the associated metadata. In both cases you have some file and you map that file to a dictionary-like object with some associated data. In that case it seems similar to how rosettasci-io works. They seem like different ways of looking at the same problem.

More-so if you want to use rosettasci-io to load other file types it would be good to have the emdfile use the syntax so you don't end up with lots of different ways to handle different i-o patterns. From a purely selfish point of view I want to be able to open files from py4dstem in hyperspy or pyxem and vice-versa without losing metadata.

High Level

The issue is about integration. Integration might mean a few different things, so I'm going to try to break this down a little into various specific possibilities. Broadly, I'll consider (1) interoperability of file formats, and (2) dependency.

(1) Interoperability

py4DSTEM should support reading HyperSpy files. It would also be great to build in Rosetta for more broad format support - more on that below.

(2) Dependency

Let "x > y" mean "x is a dependency y".

HyperSpy > py4DSTEM

The consensus of the dev team was that we're not going to add this dependency now.

py4DSTEM > HyperSpy

It does not sound like this is what you were proposing, but was not addressed explicitly one way or the other, so thought I would just add a note. If the HyperSpy team is interested in adding a py4DSTEM dependency and discussion or support from our team would be helpful, please don't hesitate to let us know.

I think I can pretty confidently say that this won't happen as we are actively trying to remove any "domain specific" code from the base package at the moment :)

RosettaSciIO > py4DSTEM

This seems like a good idea!

emdfile > RosettaSciIO

emdfile is very light, and if the rosetta team would like to support the new EMD format, I think importing emdfile.read would be the way to go.

Ehh... I don't think people will go for adding a dependency to rosettasciio. It would be good to add continuous integration testing and information about the code coverage to emdfile otherwise it is hard to use as a dependency then we could make it an optional dependency possibly? It might be good to make an issue over at rosettasciio to see what the best thing to do is.

bsavitzky commented 1 year ago

Cool - sounds like most of the main points here have been clarified. The only thing that seems outstanding to me is

Maybe I don't really understand the emdfile package but my understanding is that it is for loading .emd files and the associated metadata. In both cases you have some file and you map that file to a dictionary-like object with some associated data. In that case it seems similar to how rosettasci-io works. They seem like different ways of looking at the same problem.

From the emdfile README: "emdfile is a Python package defining write and read functions and a set of classes which together interface between EMD 1.0 files and Python runtime objects. The classes are designed to quickly build, save to, and read from filetree-like representations of data and metadata, and is built on HDF5+h5py." It is an interface between Python runtimes and a particular HDF5 specification called EMD1.0. The Python objects it creates are designed to mirror H5 tree hierarchies, allow building / manipulating such trees in the runtime rather than on file, and to be extensible to enable EMD-like read/write to any downstream code or packages which inherit from these classes.

From the rosetta README: "The Rosetta Scientific Input Output library aims at providing easy reading and writing capabilities in Python for a wide range of scientific data formats. Thus providing an entry point to the wide ecosystem of python packages for scientific data analysis and computation, as well as an interoperability between different file formats." This describes an interface between Python runtimes and a wide variety of scientific formats.

These are not the same thing!

In that case it seems similar to how rosettasci-io works.

Yup. Crucially, how is not the same as what. Both packages read and write data. And, these packages have rather different scopes, albeit within the same broad space of scientific data I/O. Rosetta supports I/O for many formats, reading many formats and writing a goodly number of them as well, allowing it to serve as an interchange between different programs with different data formats. That's great! emdfile does not do that. emdfile defines one specific format, and provides tools for working with this format including read/write. It is many, many times smaller than rosetta, and does the one specific thing that it does. Perhaps I'm off target, but it feels like there may be some underlying assumption here that all things that do broadly related things should be one thing. That seems quite incorrect (and perhaps even a little silly) to me!

They seem like different ways of looking at the same problem.

Sort of! I would agree with this in a very zoomed out sense - they both are for scientists who need to read and write data. That's pretty zoomed out! The subset of use cases these packages address under the umbrella of "data read/write for science" are quite different. emdfile doesn't read or write dozens of formats, and it isn't meant to. It defines a format and provides Pythonic tools for working with that format.

it would be good to have the emdfile use the syntax so you don't end up with lots of different ways to handle different i-o patterns. From a purely selfish point of view I want to be able to open files from py4dstem in hyperspy or pyxem and vice-versa without losing metadata.

This is where we agree! Making file interoperability simple and easy is the important thing here, imo. We should be able to modify our syntax to conform to your standard. I suspect such changes won't make their way into main/PyPI/conda until py4DSTEM v0.15.