Closed grst closed 6 months ago
Dear @mbuttner, @burtonrj, @quentinblampey, @sunnyosun, @whitews,
as the developers of packages mentioned above, please let me know what you think. This is only going to work if there's reasonable buy-in from you (i.e. depending on the core package once ready) otherwise we'll end up with "just another" package.
This is also only going to fly if someone takes a lead on maintaining that package (although it'd be great if everyone contributed). @quentinblampey already hinted that he might be willing to take that on. The current scverse core team members can provide guidance and help with code reviews, but we don't have capacity to work on the implementation.
Hi @grst,
First of all, I'm honoured for the mention in the list of fantastic authors above and its exciting to see interest in a cytometry solution within the scverse. @mbuttner and I had been discussing this before Christmas and we both agreed that pytometry is the best way forward given its integration with anndata and the scanpy workflow. CytoPy is a bit of a bloated mess and I would prefer to retire it and hand over the reins to a scverse type package - something I've wanted to drive but life has got in the way. So TLDR, my vote is for pytometry. I can't commit to solely maintaining that work if it goes ahead but I can help in some of my spare time.
I did some work with @mbuttner to introduce FlowSOM to pytometry and that is included in the latest release. It depends upon my consensusclustering package which implements consensus clustering through resampling. I don't know if scverse already has an implementation of this that you might want to use instead. However - I've noticed that in the last few months the Saeys lab have released a Python version of the FlowSOM algorithm. Probably best to use their implementation given they are the original authors (https://github.com/saeyslab/FlowSOM_Python)
Just expanding on some of the features you mentioned and adding a few more ideas:
To add to feature requests: Best case there is an efficient way to save results in a manner that results can be opened in flow cytometry software (in the past I was writing results back to FCS).
Hi everyone,
I'd like to second on @burtonrj points. I have a similar impression that what @grst describes is largely covered by pytometry. I'm game on bringing everyone to the table and discuss at which parts we can join forces and where it's less useful.
FlowKit focusses more on the pre-processing side, gating functionality and integration with FlowJo (including IO or transfer of gates). That's a great approach and mimics most of the functionality in Python that classic flow data analysis entails. We already had a longer discussion on the integration of FlowUtils and FlowKit functionality for normalization in pytometry (https://github.com/buettnerlab/pytometry/issues/47), which I'd like to refer to here. Bottom line is that we can easily integrate some of the functionality from FlowKit and FlowUtils to anndata/scverse, but not all of it is well-suited or should be re-implemented in scverse. Instead, I think that the existing compatibility functions to anndata and FlowKit are a great way work with FlowKit.
On the idea of integrating gating functionality. I think it's probably best to implement this as a separate package, and keep the core package rather slim. I want to note here that we have been cooking up a tool to infer gating strategies from clustered data, which loops the gating strategy back in to the picture (https://github.com/buettnerlab/convexgating) - but this should stay a separate package just like scyan.
On the note if the readfcs package: @sunnyosun and @falexwolf supported me a lot with re-structuring pytometry and we coordinated the integration with pytometry. Pytometry completely relies on the readfcs package as data loader.
Overall, I would be happy to invest some more time into the strategy of the scverse for flow data.
Hello everyone,
I'm aligned with @mbuttner and @burtonrj's assessments and happy to help if needed. Just wanted to add a point that we initially decided to have readfcs as a separate upstream I/O package so it's lightweight and free from all analytics dependencies.
Hello all,
It's nice to see so much excitement about this! I agree with all that has been said, and as @canergen I also think that FCS-writing should be in the scope.
Out of what has been said, I see two main options:
As everyone said, Pytometry is the closest to the scope. Moving it to scverse can be a great solution, with minor adaptations to include some functionalities from the other packages.
Pytometry has quite a lot of dependencies, and I think it could also be great to have a very lightweight package. I was thinking about the following dependencies: anndata
, seaborn
, and eventually fcsparser
(we could also have our own parser to not depend on fcsparser
).
Even with such minor dependencies, we could still do the first three goals listed in the above scope, while the fourth goal (i.e. clustering) could be inside an extra dependency.
If this new lightweight package is developed, we should all update our "larger-scope" packages to use this instead.
Hi everyone,
thank you for the active discussion. I think that @quentinblampey raises an important point on the number of dependencies in the current pytometry package, which we should debate.
Pytometry has quite a lot of dependencies, and I think it could also be great to have a very lightweight package. I was thinking about the following dependencies:
anndata
,seaborn
, and eventuallyfcsparser
(we could also have our own parser to not depend onfcsparser
). Even with such minor dependencies, we could still do the first three goals listed in the above scope, while the fourth goal (i.e. clustering) could be inside an extra dependency.
If any, I think that the capabilities of visualization in pytometry should be expanded rather than reduced and adapted to the specific needs of displaying large amounts of cells and cell densities. The datashader
dependency in the package is a big one (and does not come without its problems, agreed), but it was the best way we could come up with to visualize flow data without downsampling.
I can see the advantage of the new "lightweight package", which could be the pytometry package essentially without the visualization module. Happy to hear your thoughts.
From my experience, there is no advantage to having a package with very few dependencies by itself. Instead, it very much depends:
But the pure number of dependencies is meaningless. If you depend on 40 tiny, well tested, well behaved packages, there is no problem.
My recommendation: we should check if any dependencies meet any the criteria above and make them optional. That way, everyone can install the package easily and quickly, but advanced functionality becomes available when specifying the extra
. E.g. installing scanpy[leiden]
is something most people want, but scanpy is still functional without the associated dependencies being installed.
Just to add on the data visualisation side of things:
I think that plotting should be part of the package. I tend to side with @flying-sheep that dependencies we deem too heavy could be optional. Datashader is great - it would be great to have that in scanpy eventually (see https://github.com/scverse/scanpy/issues/2656) but we might still need it in the flow cytometry package as an interim solution.
The datashader dependency in the package is a big one (and does not come without its problems, agreed), but it was the best way we could come up with to visualize flow data without downsampling.
FWIW, if all you need is the equivalent of pytometry.plotting.scatter_density
, you could use a non-datashader solution here. Even np.histogram2d
could be viable.
But the pure number of dependencies is meaningless. If you depend on 40 tiny, well tested, well behaved packages, there is no problem.
I don't want to sidetrack too much, but I would push back on this. If you think people may want to use your package in a secure environment (e.g. a hospital) number of dependencies and publishers of those dependencies can absolutely be an issue. Especially in light of recent supply chain attacks on xz
and pypi.
If some of the dependencies do weird stuff like upper-bounding their own dependencies’ versions, that makes things complicated.
Weird behavior can also include: being aggressive on lower bounds, having a solo developer who disappears, being pulled for some reason. And for how many packages do you need to be evaluating/ on top of this for?
Greetings all,
It's great to see this discussion on unifying the Python cytometry development efforts. Reading through the comments here gives me deja vu from past brainstorming sessions on implementing cytometry data models and functions. @grst references FlowKit but that package came later, building on previous work. Some of that history might be relevant to this discussion...
I currently maintain FlowIO, FlowUtils, and FlowKit. FlowIO & FlowUtils were originally split off from the fcm
library I inherited here at Duke some 10 years ago now. I think fcm
was the first Python package to read FCS files, however it also bundled a heavy dependency stack (including PyCUDA) to support a hierarchical Dirichlet clustering algorithm. As a result, it was a bear to maintain and install, especially just to read FCS files, so we decided to split it up.
FlowIO didn't need to interact with event data, so no dependencies were required (not even NumPy). As the package name implies, it also writes FCS files. It is a simple package with one FlowData class and a few functions. I admit there are pros & cons to this lean approach, but it has proven very useful.
FlowUtils became home to utility functions related to compensation and transformation. Later it would expand, but always focused on well-defined functions that would not likely change over time and/or would benefit from optimization. It has C extensions to make things as fast as possible, and in pure C to reduce dependencies. The C extensions mean users would need compilers, so pre-built binaries in PyPI are needed. Having well-defined functions that don't change allows for a stable package where pre-built binaries can be offered without too much of a maintenance burden for the combination of operating systems & Python versions. It stayed as a collection of functions rather than classes for ease of testing. It can also be easily wrapped by classes (see FlowKit.transforms).
Converting those FlowData lists to NumPy arrays, then applying compensation and transforms...it was too cumbersome. We wanted something more convenient, but we also had another idea. If we could just get event data for live single cells, or for this other project, maybe get just the cytokine positive populations. Then we could offer better statistical analysis, and often a given clustering algo would perform much better on this cleaner subset of data. This was the prize for all the gating code...those algorithms just work so much better given clean data. Yes, there are other concerns (batch effects, etc.) but this is one step closer. This was the goal for FlowKit, but with the caveat that we didn't want to bundle those algorithms directly into FlowKit (to avoid the volatility of chasing trends).
There's quite a lot more I could add about what I've learned, what worked and what didn't, but wanted to chime in with some of the history and add that I would also be willing to help out with a scverse flow project. I do have ongoing plans for the libraries I maintain, next up being FCS 3.2 support in FlowIO. Another idea is to add AnnData export to FlowKit, maybe as an optional dependency. Lots of possibilities! Looking forward to more discussions.
Kind regards, Scott
Hi @whitews,
thanks for your insights! How does FlowIO
relate to FCSParser
, which I believe is used by readfcs
/pytometry
at the moment?
I think there has been some consensus to move on with pytometry
within scverse and of course every input from your side is welcome and interoperability with your set of packages highly appreciated :)
Hi Gregor,
There's no relation between FlowIO and FCSParser that I am aware of. FlowIO is actively maintained since it is used by FlowKit. The same for FlowUtils.
As mentioned above, since FlowIO is targeted as a reader/writer, I recommend using the Sample
class from FlowKit for any analysis on event data since any preprocessing is taken care of and it has convenient methods for compensation, transformation, and basic visualization. Using FlowKit also provides support for the standard gates defined in the GatingML 2.0 specification with a gating strategy implementation that a lot of thought and testing went into. And you'd get FlowJo wsp parsing. I know the traditional gating stuff isn't the aim of this proposed project, but for reasons I stated above, my opinion is that they can be a powerful tool when used together. Ok, that's my last sales pitch for FlowKit!
In any case, I'm happy to help in any way I can. -Scott
Pytometry has been moved to scverse/pytometry and @quentinblampey has been added as an additional maintainer 🎉
I encourage everyone here to open issues in the pytometry repo to discuss improvements and further directions, also with respect to interoperability with @whitews's set of tools!
Closing this issue, as a way forward has been decided and more technical discussions can happen in the pytometry repo.
Thanks @grst, I'll open a few issues in the next few days!
As discussed in the last governance meeting, I propose to add a new core package for dealing with cytometry data (e.g. flow cytometry, CyTOF).
Reasonsing
There's already several packages for flow cytometry in Python:
Many functions are redundant between those packages. It would be great to have the developers maintain them jointly in a central place to avoid duplicated efforts and everyone benefitting from speed optimizations.
Scope
(details to be discussed)
The package should include
The package should NOT include
All functions should be scalable to 10s of millions of cells (better 100s of millions). Partly, this can already be achieved by AnnData's new out-of-core support and GPU support through rapids-singlecell. In other cases subsampling approaches might be useful.
Next steps
If we decide to move forward with this, we'd first create a repository in the scverse namespace where such a package can be developed. To that end, the best implementations of the respective functions could be "stolen" from the packages listed above. Once the package is reasonably mature, it can be promoted to a core package and advertised as such.
Alternativly,
pytometry
could be moved to the scverse organization and serve as a starting point (IMO it fits the scope outlined above best, although I think there are more efficient implementations in the other packages for some steps).