Splitting HyperSpy - Githubissues

francisco-dlp commented 8 years ago

We define HyperSpy as "HyperSpy is an open source Python library which provides tools to facilitate the interactive data analysis of multi-dimensional datasets that can be described as multi-dimensional arrays of a given signal". However the current project does much more than that. For example, as pointed out in #804, we are maintaining an elements database, we provide especialised tools EELS, EDX and, in doing so, we also implement things (e.g. EELS xsections) that could well be part of an EELS library.

There is a chance that, if we split the project in multiple subprojects, those subprojects may develop a community around them that may be wider than HyperSpy's current community because, to most, at present HyperSpy looks like a EM data analysis toolbox. Also, I think that it could increase the attractiveness of hyperspy to researchers from other fields that, in this way, could lead their own subprojects on their field of interest.

So, what about splitting the project e.g. as follows:

hyperspy-core: basically Signal, Spectrum, Image
hyperspy-visual: plotting features
hyperspy-fitting: multi-dimensional curve fitting
hyperspy-learn: machine learning tools
hyperspy-tem: all of the current TEM features.
Other more specialised projects that have nothing to do with multi-dimensional data analysis but that other hyperspy-* project depend on e.g
- elements
- eels_xs
- tem_fileformats
- etc

vidartf commented 8 years ago

Just a minor comment to get the ball rolling: hyperspy-tem -> hyperspy-em, and include SEM :)

Other than that, most of these seem like just moving existing code around, except maybe "tem_fileformats". If we separate this one, wouldn't we have to create a much more rigorous interface? Or can we use the one we have now?

magnunor commented 8 years ago

I agree, especially since people want to add more diverse types of functionalities https://github.com/hyperspy/hyperspy/issues/787 . So this is a good way of reducing the "bloating".

I'm guessing this would also clean up the issue tracker a little bit, since each sub-project would have its own.

How would it work in practice? Each sub-project would have its own project in hyperspy, like hyperspy/hyperspy-core, hyperspy/hyperspy-visual, ...?

dnjohnstone commented 8 years ago

In #804 @francisco-dlp mentioned that there are "advantages and disadvantages" to splitting the code -- could someone briefly explain what the disadvantages are?

I guess the main thing is something to do with how you get the separate projects to be combined back together for the user who will likely want an "out of the box" solution as far as possible?

I'm also unclear as to what this would mean for a new contributor wanting to add a totally new signal class and associated functionalities - is it possible to do something like a pull-request that creates a totally new project within an organisation that you are not already a member of? e.g. if we had the projects Francisco suggests how would I add a new project hyperspy-raman?

jeinsle commented 8 years ago

I would echo Duncan's comments that splitting will possibly dilute the overall hyperspy product. Sorry if that is a bit marketing, but that is my original background before academia. looking at the hyperspy environment, the strength is that all the tools are in one library. I would, suggest that the stronger way to make hyperspy understood as a general tool and not just a tool for EM is to start demonstrating it working on non-EM datasets. Breaking it into smaller projects/libraries I think will just expose it as a tool using other better known python libraries and not maintain what is unique about it.

just my 2p on the matter.

j

On 2016-01-08 22:24, Duncan Johnstone wrote:

In #804 [1] @francisco-dlp [2] mentioned that there are "advantages and disadvantages" to splitting the code -- could someone briefly explain what the disadvantages are?

I guess the main thing is something to do with how you get the separate projects to be combined back together for the user who will likely want an "out of the box" solution as far as possible?

I'm also unclear as to what this would mean for a new contributor wanting to add a totally new signal class and associated functionalities - is it possible to do something like a pull-request that creates a totally new project within an organisation that you are not already a member of? e.g. if we had the projects Francisco suggests how would I add a new project hyperspy-raman?

Reply to this email directly or view it on GitHub [3].

*

Links:

[1] https://github.com/hyperspy/hyperspy/issues/804 [2] https://github.com/francisco-dlp [3] https://github.com/hyperspy/hyperspy/issues/821#issuecomment-170144604

vidartf commented 8 years ago

With regards to the con of splitting it up, the split can be made in such a way as to be mostly invisible to the end user. Its similar to how e.g. Microsoft Word and PowerPoint share some common code among themselves, which is invisible from the end users' point of view. The only practical difference I guess would be the location of imports (major API change), meaning most existing code will need to be updated. In most cases this will hopefully only be a single line of code that needs updating (e.g. import hyperpsy.api as hs -> import hyperpsy.em as hs, since mot existing users will be electron microscopists).

Either way, I'm not sure this is something we want to include with v0.9.? Seems to me this change should be introduced early in a development cycle (directly after 0.9?) in order to give it time to mature (i.e. figure out a workable interface between any sub-packages).

vidartf commented 8 years ago

With regards to package organization, since we want to keep the name hyperspy in everything, I would recommend creating a namepace package:

For python 3.3. onwards, this means each project would have a folder hyperspy without __init__.py, in which the different packages are kept as sub-directories. See PEP420.
For prior versions, each project would still contain a folder hyperspy with sub-dirs, but it needs to contain an __init__.py that declares it to be part of a namespace package using pkg_resources.declare_namespace(). This would also suggest to name the projects hyperspy.core, hyperspy.em etc, but is not strictly necessary. For an example see pyQode.

Again, see PEP420 for an in-depth discussion of the issue.

dnjohnstone commented 8 years ago

Sorry if my message was a little unclear, I think this change affects different people differently, broadly as follows:

Users -- should see no difference, everything will still appear in one place

Developers -- will have to worry about what @vidartf describes, which is what I initially meant just didn't have any idea how that was usually done - from what you say, it doesn't look too bad to me.

Contributors -- I'm still unclear on whether the contribution scheme would change significantly (i.e. whether there's still a pull request scheme) if the creation of a new signal classes and specialised tools required the creation of a totally new package?

dnjohnstone commented 8 years ago

I should also say that from a development point of view and a contribution point of view I think splitting would be very beneficial - provided the contribution scheme is still clear. The question being framed in the negative sense was more to make sure we try to think of problems before finding them by experience...

magnunor commented 8 years ago

The installers on the HyperSpy webpage should still contain the "full" install, with all the sub-projects seamlessly integrated. So for the average end user these changes shouldn't be noticeable.

For installing through pip I guess we can split it into several sub packages? core, visualization, em, ... . And have a hyperspy-full package? Is that doable?

Similarly, for the Debian package (when it gets added), we can have these separate packages. With a hyperspy-full for installing all the sub-projects.

Keeping a git directory within another git directory can be handled be using git-submodule. I haven't used it before, but it looks fairly straight forward. I'm guessing the main issue there is adding some good explanation in the contributers guide, so avoid raising the barrier for contributing.

vidartf commented 8 years ago

@magnunor: What is git-in-git useful for? I guess there could be quite a hassle to set up a boiler plate hyperspy development environment (checkout these 5 repositories, and install them all with pip -e), and that the git-in-git could help with that? Would it also help with synchronizing branches across projects?

magnunor commented 8 years ago

@vidartf, thinking about it a little more, I guess the git-in-git is not really necessary.

I was thinking if we wanted to keep the current layout with for example all the drawing functions in a folder inside the main hyperspy folder, we would have to have the "visualization" sub-project inside the folder of the "core" sub-project.

But I'm guessing all this is can be handled by having all the sub-projects in $PYTHON-PATH or some other python magic.

I agree with @vidartf about the timescale, fairly quickly after 0.9 seems like a good time. Since it should allow us to iron out any issues before the next release.

francisco-dlp commented 8 years ago

@vidartf, yes, splitting the file formats will require changing the interface and splitting all hspy related parts. @mfm24, who has an interesting implementation of a dm3/4 reader/writer suggested doing this a while ago in a private conversation. I think that it makes sense to split the file formats so that other projects can use then without installing hyperspy and all its dependencies. This may also encourage more contributions, hopefully also from manufacturers, as contributing a file format to a single repository would automatically add support for it to a number of projects. Drawback: it requires effort and there is little to be won in the short term. Regarding, PEP420, yes, that looks like the right tool for the job.

@dnjohnstone wrote:

is it possible to do something like a pull-request that creates a totally new project within an organisation that you are not already a member of?

As I see it, there are several possibilities:

The person request to be member of the organization to create the project
The person creates an independent project based on HyperSpy and
- Once it matures ask for inclusion under the HyperSpy organization. This would provide the project more visibility and the convenience of a wider community of developers.
- It stays independent. This may be suitable for people wanting to keep full control on who has write access to their repository.

All alternatives look reasonable to me. The goal is to expand the community. The more people use hyperspy, the more developers there will be polishing the core features, what should benefit all users, including the current TEM community.

@magnunor, yes I agree that it is important to offer the possibility to install everything in one go. That should be easy to achieve.

In what follows I'll summarize some advantages and disadvantages of splitting the code, including the ones that you all have mentioned:

Advantages

Reduce bloating
Clear up issues list
Reduce the number of dependencies of the core packages. For example, just for number crunching there'll be no need to install any plotting and gui packages.
Split the community:
- People can follow the development of the part of HyperSpy that interest them. E.g. someone from a potential remote sensing community may not be interested about the developments in the TEM team.
- Clearer leadership: the developers of a sub-project will manage their own PRs, issues etc and have their own commits ranking. For example, while active @pburdet was essentially leading the EDX feature development, but it is difficult to realise that he had this role with the current structure. Clearer leadership of sub-projects should encourage more people to add features to analyse other non-supported signals. This should also ease the task of those maintaining the packages, as they won't be distracted by issues affecting only specialised sub-packages.

Disadvantages

Risk of diluting the project (@jeinsle).
Cohesivity. Different, inter-dependent projects will be difficult to manage. Currently, by running the tests we know when we've broken something somewhere. However, if we split the code, we'll need to update all the interdependent projects and run the tests on each on them individually to see if something got broken. I ignore if travis can be configured to automatise this task, but chances are that the answer is yes.
Splitting the project is a major task with little rewards in the short-term and with the potential of increasing the burden once it is done. I think that, if the project is to grow beyond the electron microscopy community, splitting it is unavoidable. The question is, when is the right time to do it? Who can find the time to do it? May this task be suitable for a Google Summer of Code project? (Deadline 19th of February)

magnunor commented 8 years ago

I've been thinking about this a bit more, and I think it would make it much easier to expand HyperSpy's functionalities past the (mostly) EELS/EDX functionalities at the moment.

For example: I've made a fairly extensive python program to analyse atomic positions in STEM images, relying on HyperSpy's 2-D modelling for fitting the atomic columns. I feel like this would be very nice to have a part of the HyperSpy "package", but I also feel at would introduce quite a bit of complexity which would make it harder to maintain the core parts of HyperSpy.

Likewise, I'm also working with pixelated STEM at the moment. And I've also developed quite a lot of python programs, again relying on HyperSpy for fitting and modelling. I think this functionality would be very nice to have nicely integrated in HyperSpy.

I feel having these as separate sub-projects would allow us to expand the functionality of HyperSpy, without introducing too much extra complexity into the core part of the project.

tjof2 commented 8 years ago

I've also "used" HyperSpy in my work on time-resolved STEM, in that my independent C++ code has a HyperSpy wrapper (courtesy of @bm424) that means it will take a signal, denoise it using my algorithm, then return it to HyperSpy. Hence I can utilise the useful file I/O part of HyperSpy for different file formats, and also all the plotting/saving etc. I can also make use of align2D for my sequences before denoising.

I've also got some code in the works for atom identification in time sequences (a little different to @magnunor's application for reasons I won't go into), where again HyperSpy fits nicely into my workflow but I also use other packages.

HyperSpy thus becomes my "one stop shop" for EM import, processing and plotting, but when needed I can very straightforwardly send the data to another Python/C++/FORTRAN program for that step.

Perhaps a clearer explanation (beyond just how to create a signal) via an example/tutorial/docs/webpage on how to do this simply - marketing HyperSpy as something you can integrate into your workflow in the same way you might do from sklearn import * :-D

francisco-dlp commented 8 years ago

The EM features in HyperSpy keep on growing at an amazing rate. This is obviously very good, but it also solidifies the perception of HyperSpy as an EM library. We've discussed extensively above about splitting the project and maybe this is the right time to start taking this seriously. Originally I suggested to split it into multiple sub-projects and, although in time that may be the right path to take, the task looks overwhelming at present. So what, about the more modest goal of splitting the project into "hyperspy" and "hyperspy-em" to start with? This could be the main milestone of HyperSpy 2.0 and it shouldn't be too hard to achieve. Although less ambitious than the original proposal, it probably is a good compromise of benefits / (effort * drawbacks). Some benefits include:

Making (core) hyperspy more attractive to researchers in other fields.
Reducing the dependencies of (core) hyperspy
Better addressing the EM community with a User Guide, mailing list and website tailored for this community and a github repository which contains EM related code and issues only.

Drawbacks:

None that I can think of.

magnunor commented 8 years ago

Doing the "hyperspy" and "hyperspy-em" split first sounds like a good idea, since it reduces the complexity by not having to sort into many different sub-projects.

In addition, having the simpler split first allows us sort out the method and "infrastructure" needed to do the split. This should make subsequent splits (if we opt to) easier, since we'll have more experience with doing these kinds of things.

dnjohnstone commented 8 years ago

HyperSpy & HyperSpyEM sounds like a good idea to me

dnjohnstone commented 8 years ago

Continuing from #1209 -- I'm more and more wondering whether hyperspy-em might be too narrow.

Lets say I write tools to analyse diffraction data - if I apply to electron diffraction data ok it's EM, but much is likely common to XRD. So for code reuse efficiency, do you insist that I make a hyperspy-diff and then have hyperspy-em depend on that? I'm sure there are other examples where code reuse could become possible if we keep broader.

Also I think that it may make hyperspy less attractive than it could be to users who are most likely really interested in say materials characterisation. Most people interested in materials characterisation will use multiple techniques and so could find it particularly appealing to be able to download, install and apply with a similar protocol analysis to a wide range of techniques.

I think it could quite easily become too fragmented with many small packages and making them all compatible to give an overarching characterisation suite could be a very big job.

to266 commented 8 years ago

I'm for fragmenting the overall package, as all the installing issues have already been solved (e.g. extras in setuptools would allow you to do something like pip install hyperspy[all] if you want everything, and pip install hyperspy[core] if you want only the core, with as many sub-packages (and their levels) as we'd like)

From what I can tell it's generally always advisable to split packages into independent sub-packages..

magnunor commented 8 years ago

@dnjohnstone, agreed, hyperspy-em seems too narrow. I'm guessing doing a hyperspy-core + hyperspy-X, where X would be something related to material characterization?

francisco-dlp commented 8 years ago

+1 for hyperspy-X!

francisco-dlp commented 8 years ago

Now seriously. I think that these are all very good points. The main objective of the split is to have a hyperspy core package that people from disparate fields feel comfortable using as a base for their own packages. Currently, HyperSpy doesn't fit the bill because to people from e.g. remote sensing it must look bloated. On the other hand, I see that it won't feel that bloated to people from other material characterisation methods as pointed out above. So, it seems reasonable to split into hyperspy and hyperspy-X. What should X be? "Materials characterisation", McHyperSpy? Or maybe "Materials science", MsHyperSpy?

tjof2 commented 8 years ago

hyperspy-matsci?

Keeps the hyperspy-[] format which is useful for installation from repositories etc.
Pronunciation: mat-skee or mat-sigh? :-P

Or hyperspy-mater or similar.

ericpre commented 8 years ago

+1 for HyperSpy-X.

Something about a materials characterisation would be a good move. Looking at material characterisation on wikipedia, these techniques have a lot of similar data processing in commum. It will also corresponds to a scientific community.

HyperSpy-MatSci sounds very good, but it's also very broad topic. With such a name, people may expect a library, which is also about materials computation, such as pytmatgen, MatMethods or other.

HyperSpy-MatCharac (or something similar) could also fit the bills. The name is not as good as HyperSpy-MatSci, though...

francisco-dlp commented 8 years ago

MatSpy, McSpy, MsSpy?

ericpre commented 8 years ago

Quite like MatSpy actually! Simple, short, the Spy part kind of contains the characterisation bit and it is has some similarity with HyperSpy in the name construction. However, it will not fit as well as HyperSpy-X with potential future other project? But, as most of the things, there will be a need for a compromise!

tjof2 commented 8 years ago

I think losing the "HyperSpy" part of the name runs the risk of dilution as referred to by @jeinsle above.

HyperSpy-MatSci sounds very good, but it's also very broad topic. With such a name, people may expect a library, which is also about materials computation, such as pytmatgen, MatMethods or other.

Sure hyperspy-matsci might look pretty general, but that was the problem with hyperspy-em being too narrow!

Really it's a trade-off with it being a memorable, snappy name. Suitable documentation and a well-defined project scope (and README!!!) and the fact it has the hyperspy name as well should identify it as being separate from things like pymatgen.

francisco-dlp commented 8 years ago

I agree. hyperspy-mater or hyperspy-matsci have the advantage of easy discoverability of hyperspy subpackages. However, users will end-up finding a nickname for it. MatSpy is something that people may not mind pronunciating. It's not as unique as HyperSpy (4900 entries in google), but it may be unique enough.

Another option is hyperspy-core and hyperspy. This would be the path of less disruption.

tjof2 commented 8 years ago

I think people would probably still call it hyperspy...which is what it is really :-)

ericpre commented 7 years ago

There is a bit of splitting discussion in #455.

francisco-dlp commented 7 years ago

I think that the time has come to split hyperspy for real. I think that the easiest splitting approach would be to split it into a core hyperspy package and specialised signals (e.g. EELS, EDS, etc), what do others think? I have just drafted an issue to track progress #1599.

jat255 commented 4 months ago

@ericpre I think this can be closed, no? ;) Or perhaps @francisco-dlp would like the honors

hyperspy / hyperspy

Splitting HyperSpy #821

Links:

Advantages

Disadvantages