Open CSSFrancis opened 1 year ago
@bsavitzky I saw you self assigned this. If this is something you are actually interested in I think that this is important enough that I would spend a decent amount of time to make it a reality.
Just let me know. I'm also open to talking about merging functionality with pyxem but that might be a little more difficult.
Hey @CSSFrancis! Thanks for opening the issue - I think it's definitely a worthwhile discussion to have.
I do have some reservations. In a number of cases, the tools or functionalities you've identified are things we've built but haven't been moved into the main branch / PyPI yet. That said - not all of them. My other / primary reservation is the added complexity involved - I know hyperspy has split off a number of its elements and that this has simplified its dependencies some, but it is still is a pretty heavyweight package.
Ok, details. First off - py4DSTEM version 0.14, which is currently on the development branch and is scheduled to launch on main/PyPI in two weeks, moves basic data structure and IO tools and management out of py4DSTEM and into emdfile (https://github.com/py4dstem/emdfile), which in turn is now a dependency. emdfile is a standalone, lightweight Python / HDF5 interface, includes arrays with axes metadata of arbitrary dimensions, and handles arbitrary metadata dictionaries, among other things. I believe in v14 the issues you bring up in point (2) are handled with this change. This also means that if we were to integrate, something like modification of the Array class (to inherit from Signal2D?) would need to happen at the level of emdfile. The primary design principle for emdfile was aiming for a simple and lightweight interface for complex structured data. Its current dependencies are h5py, tqdm, numpy, and scipy. (And scipy isn't really needed and likely will get removed soon). Adding hyperspy would be a huge new dependency.
File IO is pretty important, and I think integrating with rosettasciio is very much worth discussion and perhaps/hopefully implementation - we do not currently support enough formats! Integrating with seems like it would be a lot simpler and easier than integrating with hyperspy, as rosetta appears to be much lighter / has many fewer dependencies, and would get us many of the benefits.
Both matrix decomposition, as well as parallelization, are things we've built and included at various points on various branches for various specific functionalities - but as you've correctly identified, we do still lack a unified framework which has been clean and sensible enough to be easily applicable across functionalities and on the main branch. I completely agree that including a simple / accessible / unified interface matrix decomposition methods (NNMF / SVD / PCA / etc) is a very good idea that's worth doing. Even more so, a single more sensible framework for parallelization, device config, and job distribution is very, very important. We have a number of pieces of our code which are now GPU compatible, and a number of pieces of our code which can be distributed to a cluster/multinode system, but - it's all pretty piecemeal now. So yes! This needs to be simplified/streamline. The key thing for me is both a user interface / function call signatures as well as source code structure that is clean and sensible. My feeling is that we have most the pieces here, for both matrix methods and parallelization, and whats left is putting them together with a(n as) simple (as possible) interface - and that adding another hefty package's methods or classes to do this may make that more and not less complicated. I could be convinced otherwise, but this is my sense of things now.
Interactive plotting is also something we've gone back and forth on a number of times. @cophus has been suggesting we use napari, and there are quite a few good reasons to go this route - it's a fairly mature and well supported package, is surprisingly relatively lightweight, and would allow us to build on / with many of our existing (static) visualization methods. Whether or not we go this route, we will very likely roll in functionality of this sort in the near future. I don't know if we'll end up using napari or not... I am curious to discuss what including/building these tools through hyperspy integration instead would look like. Either way, broadly speaking my experience has been that interactive plotting can be a bit of a morass, and whichever solution we go with, I want to be careful about how limiting it is in terms of which systems and environments its compatible with, and what kinds of additional setup/install burden are entailed. In my mind the worst case scenario here is that we build or integrate tools for this which enable cool/exciting plotting on some systems, and break the package on others. Or, which require a whole new set of install/debug/system-specific steps which may dissuade or be inaccessible to some users.
So these are my reservations. I'm not committed to any answer at this point, and truthfully don't know enough about hyperspy's internals to make any careful judgement, but did want to lay that all on the table.
Last: I'm appreciative of you opening the dialogue, and beyond any specifics, I think your point (6) is really important. I would love to support / foster / work towards better collective effort in the community of folks building open source tools in our space. I don't know whether or not that will look like new package dependency relationships or some integration of source codes or something else. I don't think doing that necessarily requires integration, per se, but it could. Even if we don't end up integrating there I think there is still room for more communication, collaboration, and collective effort.
@bsavitzky Thanks for getting back to me. There are a lot of good points here and I would love some additional feed back
Hey @CSSFrancis! Thanks for opening the issue - I think it's definitely a worthwhile discussion to have.
I do have some reservations. In a number of cases, the tools or functionalities you've identified are things we've built but haven't been moved into the main branch / PyPI yet. That said - not all of them. My other / primary reservation is the added complexity involved - I know hyperspy has split off a number of its elements and that this has simplified its dependencies some, but it is still is a pretty heavyweight package.
I looked at the difference in dependencies and most of them are fairly light weight with the exceptions of a couple of packages that you also have as extra dependencies. Hyperspy loads decently quickly these days see (https://github.com/hyperspy/hyperspy/pull/2850) and a large portion of that is related to numpy. That being said I don't see this as a huge problem.
'dask[array]>=2021.3.1',
'importlib-metadata>=3.6',
'jinja2',
'natsort',
# non-uniform axis requirement
'numba>=0.52',
'numexpr',
'packaging',
'pint>=0.10',
# prettytable and ptable are API compatible
# prettytable is maintained and ptable is an unmaintained fork
'prettytable',
'python-dateutil>=2.5.0',
'pyyaml',
'requests',
'sympy>=1.6',
# UPDATE BEFORE RELEASE
'rosettasciio @ git+https://github.com/hyperspy/rosettasciio#egg=rosettasciio',
'tqdm>=4.9.0',
'traits>=4.5.0',
Ok, details. First off - py4DSTEM version 0.14, which is currently on the development branch and is scheduled to launch on main/PyPI in two weeks, moves basic data structure and IO tools and management out of py4DSTEM and into emdfile (https://github.com/py4dstem/emdfile), which in turn is now a dependency. emdfile is a standalone, lightweight Python / HDF5 interface, includes arrays with axes metadata of arbitrary dimensions, and handles arbitrary metadata dictionaries, among other things.
Is there a good reason not to contribute this directly to rosettasciio rather than making your own package? This frankly seems like another standalone package which would be required as an extra dependency for other people if they want to use it.
I believe in v14 the issues you bring up in point (2) are handled with this change. This also means that if we were to integrate, something like modification of the Array class (to inherit from Signal2D?) would need to happen at the level of emdfile. The primary design principle for emdfile was aiming for a simple and lightweight interface for complex structured data. Its current dependencies are h5py, tqdm, numpy, and scipy. (And scipy isn't really needed and likely will get removed soon). Adding hyperspy would be a huge new dependency.
This isn't really the case. After all rosettasciio has no hyperspy dependency. The signal2D Class just is duck-typed numpy array that can handle dask or numpy arrays much like the DataCube object.
File IO is pretty important, and I think integrating with rosettasciio is very much worth discussion and perhaps/hopefully implementation - we do not currently support enough formats! Integrating with seems like it would be a lot simpler and easier than integrating with hyperspy, as rosetta appears to be much lighter / has many fewer dependencies, and would get us many of the benefits.
Somewhat but you don't handle dask
arrays particularly consistently across the package. At least everytime I have tried to do something I end up having to load everything into RAM. This is a part of the reason why I think you need at least dask
as a dependency. Ultimately dask
and not numpy powers hyperspy. Or at least dask
acts as the director while numpy does the actual work.
Both matrix decomposition, as well as parallelization, are things we've built and included at various points on various branches for various specific functionalities - but as you've correctly identified, we do still lack a unified framework which has been clean and sensible enough to be easily applicable across functionalities and on the main branch. I completely agree that including a simple / accessible / unified interface matrix decomposition methods (NNMF / SVD / PCA / etc) is a very good idea that's worth doing. Even more so, a single more sensible framework for parallelization, device config, and job distribution is very, very important. We have a number of pieces of our code which are now GPU compatible, and a number of pieces of our code which can be distributed to a cluster/multinode system, but - it's all pretty piecemeal now. So yes! This needs to be simplified/streamline. The key thing for me is both a user interface / function call signatures as well as source code structure that is clean and sensible. My feeling is that we have most the pieces here, for both matrix methods and parallelization, and whats left is putting them together with a(n as) simple (as possible) interface - and that adding another hefty package's methods or classes to do this may make that more and not less complicated. I could be convinced otherwise, but this is my sense of things now.
As I have spent the last 2 years rewriting code to make this run smoothly in hyperspy I can pretty confidently say this is not as easy as you think. I also think that this is something where the moment that someone tries to put some 5 dimensional data or 6 dimensional data into py4dstem everything is going to break.
Let's see if I can give some examples:
def get_single_origin(dp, r=None):
_qx0, _qy0 = np.unravel_index(
np.argmax(gaussian_filter(dp, r)), (self.Q_Nx, slef.Q_Ny))
_mask = np.hypot(qxx - _qx0, qyy - _qy0) < r * rscale
return get_CoM(dp * _mask)
# Handles numpy, cupy and dask
class DataCube(Signal2D):
def find_peaks(self, **kwargs):
return s.map(_find_Bragg_disks_single_DP_FK, ragged=True, **kwargs)
def fit_origin(self, **kwargs)
return s.map(get_single_origin, **kwargs)
I can go on if you would like but at the end of the day with ~10 lines of code that makes your function for getting the origin parallel as well as converts your find peaks function to running in parallel and dask compatible. It also is much more legible. It also works when someone comes along with a 5 or 6 dimensional dataset now.
The key is really using dask for everything.
Interactive plotting is also something we've gone back and forth on a number of times. @cophus has been suggesting we use napari, and there are quite a few good reasons to go this route - it's a fairly mature and well supported package, is surprisingly relatively lightweight, and would allow us to build on / with many of our existing (static) visualization methods. Whether or not we go this route, we will very likely roll in functionality of this sort in the near future. I don't know if we'll end up using napari or not...
So maybe try hyperspy interactive plotting first, especially after https://github.com/hyperspy/hyperspy/pull/2568 hyperspy can plot data out of memory very effectively. It also can plot up to like 6 dimensions and we are working on some pretty big speed and visualization improvements. https://github.com/hyperspy/hyperspy/pull/3148. I plot multi-terabyte datasets with hyperspy fairly easily.
I am curious to discuss what including/building these tools through hyperspy integration instead would look like.
It depends what you want to do. If you extend the signal2D class...
# plot static
datacube.plot()
#plot interactive
%matplotlib qt
#%matplotlib widget
datacube.plot()
# adding points from diffraction vectors.
vectors = hs.markers.Point(vectors.) # this might take a little bit to get them correct
datacube.add_markers(vectors)
datacube.plot() # plot interactive with markers overlaid
Either way, broadly speaking my experience has been that interactive plotting can be a bit of a morass, and whichever solution we go with, I want to be careful about how limiting it is in terms of which systems and environments its compatible with, and what kinds of additional setup/install burden are entailed. In my mind the worst case scenario here is that we build or integrate tools for this which enable cool/exciting plotting on some systems, and break the package on others. Or, which require a whole new set of install/debug/system-specific steps which may dissuade or be inaccessible to some users.
So this is part of the reason that the hyperspy bundle exists. I doubt napari will be terribly much easier to work with. It does make running sessions a little more difficult but not having interactive plotting is a huge help. Besides the different ways to do things exist for a reason, (ie if you are running remotely or on a local computer). Because napari is built on top of qt you can't run things remotely which for people who are using jupyter on a computing cluster makes it pretty much useless.
So these are my reservations. I'm not committed to any answer at this point, and truthfully don't know enough about hyperspy's internals to make any careful judgement, but did want to lay that all on the table.
Yea I realize I maybe don't fully understand the entirety of py4dstem either but I can tell you that hyperspy is pretty fast across the board. There are some functions which could probably be 1-2 times faster if they didn't use the map
function but that requires us to write custom code that is hard to read, hard to write and hard to maintain.
Last: I'm appreciative of you opening the dialogue, and beyond any specifics, I think your point (6) is really important. I would love to support / foster / work towards better collective effort in the community of folks building open source tools in our space. I don't know whether or not that will look like new package dependency relationships or some integration of source codes or something else. I don't think doing that necessarily requires integration, per se, but it could. Even if we don't end up integrating there I think there is still room for more communication, collaboration, and collective effort.
So this is worth more discussion but:
hyperspy
a. I really think that it is the right move but I am a little biased I know it takes away some of your agency but there are only a handful of people in the EM community actively developing software so splitting into different camps tends to hurt a fair bit.
b. I will try to do a decent amount of the work getting things implemented. (or at least seeing if I can get things implemented.)hyperspy
you should make a generic map
function similar to what is in hyperspy. This requires you to use dask
but I can't recommend it enough.We can at the very least make a feed stock of 2-D methods then for doing electron diffraction which is completely tested and only depends on numpy. In that case LiberTEM can also use them with the UDF
function as well. Maybe that is the easiest place to start.
Let me make a new issue for the last point.
@CSSFrancis already covered most of the relevant points and I will just add some details.
Napari is very good, particularly for interactivity, has a very strong community but it is very much focused on life science workflows where working with layers, labelling, 3D volumes is very important. Even if it can be used for our application, and I wish that we could use it, it has several drawbacks:
pyarrow
and a number of the aws
libraries which I fixed it in https://github.com/conda-forge/py4dstem-feedstock/pull/36zarr
files (zspy
is the counterpart of hspy
for the zarr
format) is currently by far the best choice for its performant and easy of use - same syntax as h5pyMaybe one actionable of this discussion is to identify more precisely what are the blockers from py4DSTEM perspective for using rosettasciio and/or hyperpsy and discuss if/how it can be solved?
@ericpre Thanks for articulating some of those points. I've always found qt great but a little frustratingly complex so it's good to put a finger on exactly why.
Matplotlib I think gets a lot of undue hate when it gets compared to newer (fancier) plotting libraries but I think there is a reason that it remains a solid part of any scientific ecosystem. There are times when it is slow but I've always found that there is always a way to plot things faster if you look into their documentation. The only thing that is a little lackluster is the 3-D plotting tools but I kind of find 3D plotting, while cool to look at, pretty useless to try and do any analysis. That's part of the reason I really like hyperspy's format of separating dimensions.
For interactive visualisation of large dataset, there is room for further speed up with caching decoded chunks (https://github.com/zarr-developers/zarr-python/pull/1214). When I tried 2 years, it was very impressive
This is going to be really cool when it gets merged :)
Maybe one actionable of this discussion is to identify more precisely what are the blockers from py4DSTEM perspective for using rosettasciio and/or hyperpsy and discuss if/how it can be solved?
+1
Hey folks - thanks for the lively discussion! I'll try to take detail / specific topics one-by-one, and then summarize where our team's thinking is on the high level questions at the end.
I looked at the difference in dependencies and most of them are fairly light weight with the exceptions of a couple of packages that you also have as extra dependencies.
HyperSpy dependency management let the user decide what type of dependency is required
So this is part of the reason that the hyperspy bundle exists.
Our extra dependencies are extras intentionally, to keep install as light and simple as possible for users who don't need them (which I believe is a majority). The bundle installer is great - the more granular control this enables is clearly quite useful, and having this available is plainly very powerful for developers and power users. I admire the breadth and control of functionality HyperSpy's installation model makes possible. That said, my preference is for most users to not need to think or worry about dependencies or dependency management at all. For now the dev team discussed and agreed that we are not planning to add HyperSpy as a py4DSTEM dependency at this time.
Somewhat but you don't handle
dask
arrays particularly consistently across the packageAs I have spent the last 2 years rewriting code to make this run smoothly in hyperspy I can pretty confidently say this is not as easy as you think
The key is really using dask for everything
This is another area where we have different models - I think they both have advantages and disadvantages. Your fully dask-ified infrastructure is very cool and I'm excited to see what kinds of things you all do with it (perhaps at the M&M??). I did not mean to suggest that this is or will be easy - 2 years sounds like a pretty reasonable time frame for a task of this magnitude! - what I mean to say is that we have spent time thinking about and discussing distributed and parallelized infrastructure, have built code for this on some individual targeted applications, and are intentionally taking the step of scaling an underlying architecture pretty slowly. A single, uniform approach for distributed processing has advantages and disadvantages. I'd like the approach we land on to enable but not necessitate dask handling everything, for instance. We will definitely be building out infrastructure and a user-facing interface for this in the future, however the exact structure and substructure of such an interface remains TBD. While I think we will not do this with a simple import of HyperSpy, help from interested developers who have spent time building out these sorts of structures would certainly be very welcome :)
Napari is very good, particularly for interactivity, has a very strong community but it...
Matplotlib I think gets a lot of undue hate when it gets compared to newer (fancier) plotting libraries but...
These are helpful insights. I'm going to punt on this point for now - mostly because, as I said previously, this topic can be a bit of a morass, and I think there's enough discussion here for now to have addressed the main topic of this issue thread!
- As Carter already mentioned, using
zarr
files (zspy
is the counterpart ofhspy
for thezarr
format) is currently by far the best choice for its performant and easy of use - same syntax as h5py
This is also a good point. Integrating emdfile
with zarr might be the way to go here.... not sure!
Is there a good reason not to contribute this directly to rosettasciio rather than making your own package?
It seems to me they are fairly different pieces of software with fairly different scopes. I'm curious why you think this would be a good idea?
The issue is about integration. Integration might mean a few different things, so I'm going to try to break this down a little into various specific possibilities. Broadly, I'll consider (1) interoperability of file formats, and (2) dependency.
py4DSTEM should support reading HyperSpy files. It would also be great to build in Rosetta for more broad format support - more on that below.
Let "x
> y
" mean "x
is a dependency y
".
HyperSpy
> py4DSTEM
The consensus of the dev team was that we're not going to add this dependency now.
py4DSTEM
> HyperSpy
It does not sound like this is what you were proposing, but was not addressed explicitly one way or the other, so thought I would just add a note. If the HyperSpy team is interested in adding a py4DSTEM dependency and discussion or support from our team would be helpful, please don't hesitate to let us know.
RosettaSciIO
> py4DSTEM
This seems like a good idea!
emdfile
> RosettaSciIO
emdfile is very light, and if the rosetta team would like to support the new EMD format, I think importing emdfile.read
would be the way to go.
Hey folks - thanks for the lively discussion! I'll try to take detail / specific topics one-by-one, and then summarize where our team's thinking is on the high level questions at the end.
Details
Dependencies
I looked at the difference in dependencies and most of them are fairly light weight with the exceptions of a couple of packages that you also have as extra dependencies.
HyperSpy dependency management let the user decide what type of dependency is required
So this is part of the reason that the hyperspy bundle exists.
Our extra dependencies are extras intentionally, to keep install as light and simple as possible for users who don't need them (which I believe is a majority). The bundle installer is great - the more granular control this enables is clearly quite useful, and having this available is plainly very powerful for developers and power users. I admire the breadth and control of functionality HyperSpy's installation model makes possible. That said, my preference is for most users to not need to think or worry about dependencies or dependency management at all. For now the dev team discussed and agreed that we are not planning to add HyperSpy as a py4DSTEM dependency at this time.
Dask
Somewhat but you don't handle
dask
arrays particularly consistently across the packageAs I have spent the last 2 years rewriting code to make this run smoothly in hyperspy I can pretty confidently say this is not as easy as you think
The key is really using dask for everything
This is another area where we have different models - I think they both have advantages and disadvantages. Your fully dask-ified infrastructure is very cool and I'm excited to see what kinds of things you all do with it (perhaps at the M&M??). I did not mean to suggest that this is or will be easy - 2 years sounds like a pretty reasonable time frame for a task of this magnitude! - what I mean to say is that we have spent time thinking about and discussing distributed and parallelized infrastructure, have built code for this on some individual targeted applications, and are intentionally taking the step of scaling an underlying architecture pretty slowly. A single, uniform approach for distributed processing has advantages and disadvantages. I'd like the approach we land on to enable but not necessitate dask handling everything, for instance. We will definitely be building out infrastructure and a user-facing interface for this in the future, however the exact structure and substructure of such an interface remains TBD. While I think we will not do this with a simple import of HyperSpy, help from interested developers who have spent time building out these sorts of structures would certainly be very welcome :)
I'm giving a talk at M&M on using pyxem/ hyperspy to process very large datasets. I can show some of things that hyperspy/pyxem can do and we can talk there as well in person. I'm willing to talk about how to do some of these things and review code if you would like but there isn't terribly much motivation for me to duplicate these efforts in py4dstem, I have my hands full with pyxem
most of the time.
Interactive plotting
Napari is very good, particularly for interactivity, has a very strong community but it...
Matplotlib I think gets a lot of undue hate when it gets compared to newer (fancier) plotting libraries but...
These are helpful insights. I'm going to punt on this point for now - mostly because, as I said previously, this topic can be a bit of a morass, and I think there's enough discussion here for now to have addressed the main topic of this issue thread!
Sounds good :)
zarr
- As Carter already mentioned, using
zarr
files (zspy
is the counterpart ofhspy
for thezarr
format) is currently by far the best choice for its performant and easy of use - same syntax as h5pyThis is also a good point. Integrating
emdfile
with zarr might be the way to go here.... not sure!
If you take one thing away from this discussion just try to rewrite all of the emdfile code with zarr
.
emdfile and Rosetta?
Is there a good reason not to contribute this directly to rosettasciio rather than making your own package?
It seems to me they are fairly different pieces of software with fairly different scopes. I'm curious why you think this would be a good idea?
Maybe I don't really understand the emdfile
package but my understanding is that it is for loading .emd files and the associated metadata. In both cases you have some file and you map that file to a dictionary-like object with some associated data. In that case it seems similar to how rosettasci-io works. They seem like different ways of looking at the same problem.
More-so if you want to use rosettasci-io to load other file types it would be good to have the emdfile use the syntax so you don't end up with lots of different ways to handle different i-o patterns. From a purely selfish point of view I want to be able to open files from py4dstem
in hyperspy
or pyxem
and vice-versa without losing metadata.
High Level
The issue is about integration. Integration might mean a few different things, so I'm going to try to break this down a little into various specific possibilities. Broadly, I'll consider (1) interoperability of file formats, and (2) dependency.
(1) Interoperability
py4DSTEM should support reading HyperSpy files. It would also be great to build in Rosetta for more broad format support - more on that below.
(2) Dependency
Let "
x
>y
" mean "x
is a dependencyy
".
HyperSpy
>py4DSTEM
The consensus of the dev team was that we're not going to add this dependency now.
py4DSTEM
>HyperSpy
It does not sound like this is what you were proposing, but was not addressed explicitly one way or the other, so thought I would just add a note. If the HyperSpy team is interested in adding a py4DSTEM dependency and discussion or support from our team would be helpful, please don't hesitate to let us know.
I think I can pretty confidently say that this won't happen as we are actively trying to remove any "domain specific" code from the base package at the moment :)
RosettaSciIO
>py4DSTEM
This seems like a good idea!
emdfile
>RosettaSciIO
emdfile is very light, and if the rosetta team would like to support the new EMD format, I think importing
emdfile.read
would be the way to go.
Ehh... I don't think people will go for adding a dependency to rosettasciio. It would be good to add continuous integration testing and information about the code coverage to emdfile otherwise it is hard to use as a dependency then we could make it an optional dependency possibly? It might be good to make an issue over at rosettasciio to see what the best thing to do is.
Cool - sounds like most of the main points here have been clarified. The only thing that seems outstanding to me is
Maybe I don't really understand the
emdfile
package but my understanding is that it is for loading .emd files and the associated metadata. In both cases you have some file and you map that file to a dictionary-like object with some associated data. In that case it seems similar to how rosettasci-io works. They seem like different ways of looking at the same problem.
From the emdfile README: "emdfile is a Python package defining write and read functions and a set of classes which together interface between EMD 1.0 files and Python runtime objects. The classes are designed to quickly build, save to, and read from filetree-like representations of data and metadata, and is built on HDF5+h5py." It is an interface between Python runtimes and a particular HDF5 specification called EMD1.0. The Python objects it creates are designed to mirror H5 tree hierarchies, allow building / manipulating such trees in the runtime rather than on file, and to be extensible to enable EMD-like read/write to any downstream code or packages which inherit from these classes.
From the rosetta README: "The Rosetta Scientific Input Output library aims at providing easy reading and writing capabilities in Python for a wide range of scientific data formats. Thus providing an entry point to the wide ecosystem of python packages for scientific data analysis and computation, as well as an interoperability between different file formats." This describes an interface between Python runtimes and a wide variety of scientific formats.
These are not the same thing!
In that case it seems similar to how rosettasci-io works.
Yup. Crucially, how is not the same as what. Both packages read and write data. And, these packages have rather different scopes, albeit within the same broad space of scientific data I/O. Rosetta supports I/O for many formats, reading many formats and writing a goodly number of them as well, allowing it to serve as an interchange between different programs with different data formats. That's great! emdfile does not do that. emdfile defines one specific format, and provides tools for working with this format including read/write. It is many, many times smaller than rosetta, and does the one specific thing that it does. Perhaps I'm off target, but it feels like there may be some underlying assumption here that all things that do broadly related things should be one thing. That seems quite incorrect (and perhaps even a little silly) to me!
They seem like different ways of looking at the same problem.
Sort of! I would agree with this in a very zoomed out sense - they both are for scientists who need to read and write data. That's pretty zoomed out! The subset of use cases these packages address under the umbrella of "data read/write for science" are quite different. emdfile doesn't read or write dozens of formats, and it isn't meant to. It defines a format and provides Pythonic tools for working with that format.
it would be good to have the emdfile use the syntax so you don't end up with lots of different ways to handle different i-o patterns. From a purely selfish point of view I want to be able to open files from
py4dstem
inhyperspy
orpyxem
and vice-versa without losing metadata.
This is where we agree! Making file interoperability simple and easy is the important thing here, imo. We should be able to modify our syntax to conform to your standard. I suspect such changes won't make their way into main/PyPI/conda until py4DSTEM v0.15.
Hi all,
I was just looking at some of the open issues and I wanted to open up a discussion about integrating with hyperspy.
Many of the issues currently open have already been solved including
map
function.zarr
that are major speed and it would be easier to use hyperspy to handle those.Hyperspy is splitting out the EELS and EDS analysis which makes the base package much lighter and more manageable as a dependancy. It also has a very nice interface for adding signals and creating add ons. It also make the possibility of interfacing with pyxem much easier but just casting the signal type.
Integrating would be a fair bit of work but not so insurmountable and probably more on the order of a week or two rather than months. Replacing the Array class https://github.com/py4dstem/py4DSTEM/blob/3dfb79d4bd0f9184a23da571d953ee3a823117c4/py4DSTEM/io/datastructure/emd/array.py#L12 with hyperspy's Signal2D would be easy. All of the current attributes have 1:1 mappings to equivalents in
hyperspy
and you could easily use the@property
decorator to maintain the same syntax.