Core package for cytometry

scverse / governance

Governance docs for scverse

https://scverse.org/about

BSD 3-Clause "New" or "Revised" License

7 stars 0 forks source link

Core package for cytometry #64

Closed grst closed 6 months ago

grst commented 7 months ago

As discussed in the last governance meeting, I propose to add a new core package for dealing with cytometry data (e.g. flow cytometry, CyTOF).

Reasonsing

There's already several packages for flow cytometry in Python:

pytometry, led by @mbuttner, 39 stars
- based on scanpy/AnnData, essentially leveraging the scanpy workflow + some cytometry-specific functions
CytoPy, led by @burtonrj, 39 stars
- not based on scverse, provies a lot of functions for both supervised and unsupervised analysis
Scyan, led by @quentinblampey, 28 stars
- Integration and automated annotation method based on scvi-tools
- Main focus is that method, but provides also a bunch of preprocessing and visualization functions
readfcs, maintained by @sunnyosun,
- read FCS files into AnnData
FlowKit, led by @whitews, 146 stars
- focus on classical gating-based analysis
- tutorial on exporting data to AnnData/scanpy

Many functions are redundant between those packages. It would be great to have the developers maintain them jointly in a central place to avoid duplicated efforts and everyone benefitting from speed optimizations.

Scope

(details to be discussed)

The package should include

IO (FCS -> AnnData)
preprocessing (split markers into FSC/SSC and surface markers; compensation, different normalization strategies, scaling)
flow-cytometry-specific visualizations (e.g. marker scatterplot, marker expression plots)
Well-established clustering/dimensionality-reduction methods that are specific to flow cytometry, e.g. FlowSOM

The package should NOT include

more experimental methods, such as Scyan. These can be ecosystem packages.
anything that is already handled by Scanpy/rapids-singlecell
anything specific to ImagingCyTOF, it should rather interface with squidpy/spatialdata.

All functions should be scalable to 10s of millions of cells (better 100s of millions). Partly, this can already be achieved by AnnData's new out-of-core support and GPU support through rapids-singlecell. In other cases subsampling approaches might be useful.

Next steps

If we decide to move forward with this, we'd first create a repository in the scverse namespace where such a package can be developed. To that end, the best implementations of the respective functions could be "stolen" from the packages listed above. Once the package is reasonably mature, it can be promoted to a core package and advertised as such.

Alternativly, pytometry could be moved to the scverse organization and serve as a starting point (IMO it fits the scope outlined above best, although I think there are more efficient implementations in the other packages for some steps).

grst commented 7 months ago

Dear @mbuttner, @burtonrj, @quentinblampey, @sunnyosun, @whitews,

as the developers of packages mentioned above, please let me know what you think. This is only going to work if there's reasonable buy-in from you (i.e. depending on the core package once ready) otherwise we'll end up with "just another" package.

This is also only going to fly if someone takes a lead on maintaining that package (although it'd be great if everyone contributed). @quentinblampey already hinted that he might be willing to take that on. The current scverse core team members can provide guidance and help with code reviews, but we don't have capacity to work on the implementation.

burtonrj commented 7 months ago

Hi @grst,

First of all, I'm honoured for the mention in the list of fantastic authors above and its exciting to see interest in a cytometry solution within the scverse. @mbuttner and I had been discussing this before Christmas and we both agreed that pytometry is the best way forward given its integration with anndata and the scanpy workflow. CytoPy is a bit of a bloated mess and I would prefer to retire it and hand over the reins to a scverse type package - something I've wanted to drive but life has got in the way. So TLDR, my vote is for pytometry. I can't commit to solely maintaining that work if it goes ahead but I can help in some of my spare time.

I did some work with @mbuttner to introduce FlowSOM to pytometry and that is included in the latest release. It depends upon my consensusclustering package which implements consensus clustering through resampling. I don't know if scverse already has an implementation of this that you might want to use instead. However - I've noticed that in the last few months the Saeys lab have released a Python version of the FlowSOM algorithm. Probably best to use their implementation given they are the original authors (https://github.com/saeyslab/FlowSOM_Python)

Just expanding on some of the features you mentioned and adding a few more ideas:

Scaling/transformations in cytometry world is slightly different and you need to provide logicle and arcsine transformations at a minimum. We had some back and forth with @whitews about the best way to do this in the community but I'm not sure it progressed (my bad, again life got in the way). A while back I did build a small package called cytotransform as a standalone transformation tool. This might be helpful.
Gating (the elephant in the room): people in the cytometry world are married to the idea. Some would love to see it die and be replaced with purely data-driven methods, but I can't see immunologists as a whole letting that happen. It also maybe has its place for pre-processing and cleaning. I don't think the auto-gating tools I implemented in CytoPy are as relevant now given the numerous autonomous clustering/classification tools in this space but maybe some I/O tools and GatingML compliance to enable import/export to FlowJo or CytoBank should be considered - I know thats not open-source but building an open-source gating solution probably extends beyond the scope of this project.
Cytometry specific pre-processing steps such as compensation and removal of anomalies is a must. CytoPy required that the user provides a compensation matrix from a program like FlowJo but I think @whitews has a method for creating the comp matrix in FlowKit (I need to double check). As for cleaning out anomalies, there are a bunch of great tools for this like FlowAI, FlowClean, or PeacoQC. The latter is probably the best as it is new and supported by the Saeys lab, however I don't think there is currently a Python implementation.

canergen commented 7 months ago

To add to feature requests: Best case there is an efficient way to save results in a manner that results can be opened in flow cytometry software (in the past I was writing results back to FCS).

mbuttner commented 7 months ago

Hi everyone,

I'd like to second on @burtonrj points. I have a similar impression that what @grst describes is largely covered by pytometry. I'm game on bringing everyone to the table and discuss at which parts we can join forces and where it's less useful.

FlowKit focusses more on the pre-processing side, gating functionality and integration with FlowJo (including IO or transfer of gates). That's a great approach and mimics most of the functionality in Python that classic flow data analysis entails. We already had a longer discussion on the integration of FlowUtils and FlowKit functionality for normalization in pytometry (https://github.com/buettnerlab/pytometry/issues/47), which I'd like to refer to here. Bottom line is that we can easily integrate some of the functionality from FlowKit and FlowUtils to anndata/scverse, but not all of it is well-suited or should be re-implemented in scverse. Instead, I think that the existing compatibility functions to anndata and FlowKit are a great way work with FlowKit.

On the idea of integrating gating functionality. I think it's probably best to implement this as a separate package, and keep the core package rather slim. I want to note here that we have been cooking up a tool to infer gating strategies from clustered data, which loops the gating strategy back in to the picture (https://github.com/buettnerlab/convexgating) - but this should stay a separate package just like scyan.

On the note if the readfcs package: @sunnyosun and @falexwolf supported me a lot with re-structuring pytometry and we coordinated the integration with pytometry. Pytometry completely relies on the readfcs package as data loader.

Overall, I would be happy to invest some more time into the strategy of the scverse for flow data.

sunnyosun commented 7 months ago

Hello everyone,

I'm aligned with @mbuttner and @burtonrj's assessments and happy to help if needed. Just wanted to add a point that we initially decided to have readfcs as a separate upstream I/O package so it's lightweight and free from all analytics dependencies.

quentinblampey commented 7 months ago

Hello all,

It's nice to see so much excitement about this! I agree with all that has been said, and as @canergen I also think that FCS-writing should be in the scope.

Out of what has been said, I see two main options:

1. Pytometry moved to scverse

As everyone said, Pytometry is the closest to the scope. Moving it to scverse can be a great solution, with minor adaptations to include some functionalities from the other packages.

2. New low-dependency package

Pytometry has quite a lot of dependencies, and I think it could also be great to have a very lightweight package. I was thinking about the following dependencies: anndata, seaborn, and eventually fcsparser (we could also have our own parser to not depend on fcsparser). Even with such minor dependencies, we could still do the first three goals listed in the above scope, while the fourth goal (i.e. clustering) could be inside an extra dependency.

If this new lightweight package is developed, we should all update our "larger-scope" packages to use this instead.

mbuttner commented 7 months ago

Hi everyone,

thank you for the active discussion. I think that @quentinblampey raises an important point on the number of dependencies in the current pytometry package, which we should debate.

Pytometry has quite a lot of dependencies, and I think it could also be great to have a very lightweight package. I was thinking about the following dependencies: anndata, seaborn, and eventually fcsparser (we could also have our own parser to not depend on fcsparser). Even with such minor dependencies, we could still do the first three goals listed in the above scope, while the fourth goal (i.e. clustering) could be inside an extra dependency.

If any, I think that the capabilities of visualization in pytometry should be expanded rather than reduced and adapted to the specific needs of displaying large amounts of cells and cell densities. The datashader dependency in the package is a big one (and does not come without its problems, agreed), but it was the best way we could come up with to visualize flow data without downsampling.

I can see the advantage of the new "lightweight package", which could be the pytometry package essentially without the visualization module. Happy to hear your thoughts.

flying-sheep commented 7 months ago

From my experience, there is no advantage to having a package with very few dependencies by itself. Instead, it very much depends:

If many dependencies aren’t in conda-forge, you have to get them all in before your package can be on there.
If some of the dependencies link native libraries, the package is less newbie friendly as it’s harder to install. (e.g. GPU stuff, llvmlite, …)
If some of the dependencies are super large (e.g. pytorch), it’s unwieldy to develop in a normal way.
If some of the dependencies do weird stuff like upper-bounding their own dependencies’ versions, that makes things complicated.

But the pure number of dependencies is meaningless. If you depend on 40 tiny, well tested, well behaved packages, there is no problem.

My recommendation: we should check if any dependencies meet any the criteria above and make them optional. That way, everyone can install the package easily and quickly, but advanced functionality becomes available when specifying the extra. E.g. installing scanpy[leiden] is something most people want, but scanpy is still functional without the associated dependencies being installed.

burtonrj commented 7 months ago

Just to add on the data visualisation side of things:

For plotting n-dimensional datasets following some dim reduction like UMAP/tSNE I think the core scanpy package has that covered. As @mbuttner points out however, sample size can be an issue since FCS data tends to scale to millions/100 millions of events. Although once you have identified a cell subset of interest this tends to reduce dramatically. I think in most cases sampling works well enough and there are methods like density dependent downsampling (introduced in SPADE) that help ensure rare cell populations don't disappear - I have an implementation of this I can share from CytoPy.
For 2-dimensional plots such as your traditional gating strategies, there are some 'standards' that are expected from publications (set by software like flowjo), notably around transformations and the axis (logicle or arcsinh). Transformations is always going to be a dependency and I think an issue that should be solved very early on. I used flowutils from @whitews originally before making cytotransform (I'm still not sure what the best approach is). It would also be good to use custom matplotlib tick locators like in @bpteague's cytoflow, this is what inspired my methods in cytopy

grst commented 7 months ago

I think that plotting should be part of the package. I tend to side with @flying-sheep that dependencies we deem too heavy could be optional. Datashader is great - it would be great to have that in scanpy eventually (see https://github.com/scverse/scanpy/issues/2656) but we might still need it in the flow cytometry package as an interim solution.

ivirshup commented 7 months ago

The datashader dependency in the package is a big one (and does not come without its problems, agreed), but it was the best way we could come up with to visualize flow data without downsampling.

FWIW, if all you need is the equivalent of pytometry.plotting.scatter_density, you could use a non-datashader solution here. Even np.histogram2d could be viable.

Proof of concept

```python import scanpy as sc import numpy as np import matplotlib.pyplot as plt, matplotlib as mpl adata = sc.datasets.blobs(n_observations=10_000_000) adata.obs["blobs"] = adata.obs["blobs"].astype("category") sc.pp.pca(adata, n_comps=5) ``` ```python %%time res = np.histogram2d(adata.obsm["X_pca"][:, 0], adata.obsm["X_pca"][:, 1], bins=[500, 500]) ``` ``` CPU times: user 1.32 s, sys: 64.3 ms, total: 1.38 s Wall time: 1.38 s ``` ```python %%time plt.imshow(res[0], cmap="Blues") plt.colorbar() ``` ![Untitled 2](https://github.com/scverse/governance/assets/8238804/7a9b93ed-39c8-4371-b5ad-3c56ca356b78) ``` CPU times: user 171 ms, sys: 131 ms, total: 302 ms Wall time: 127 ms ``` vs ```python %time sc.pl.pca(adata) ``` ``` CPU times: user 35.3 s, sys: 697 ms, total: 36 s Wall time: 35.8 s ```

ivirshup commented 7 months ago

But the pure number of dependencies is meaningless. If you depend on 40 tiny, well tested, well behaved packages, there is no problem.

I don't want to sidetrack too much, but I would push back on this. If you think people may want to use your package in a secure environment (e.g. a hospital) number of dependencies and publishers of those dependencies can absolutely be an issue. Especially in light of recent supply chain attacks on xz and pypi.

If some of the dependencies do weird stuff like upper-bounding their own dependencies’ versions, that makes things complicated.

Weird behavior can also include: being aggressive on lower bounds, having a solo developer who disappears, being pulled for some reason. And for how many packages do you need to be evaluating/ on top of this for?

whitews commented 7 months ago

Greetings all,

It's great to see this discussion on unifying the Python cytometry development efforts. Reading through the comments here gives me deja vu from past brainstorming sessions on implementing cytometry data models and functions. @grst references FlowKit but that package came later, building on previous work. Some of that history might be relevant to this discussion...

I currently maintain FlowIO, FlowUtils, and FlowKit. FlowIO & FlowUtils were originally split off from the fcm library I inherited here at Duke some 10 years ago now. I think fcm was the first Python package to read FCS files, however it also bundled a heavy dependency stack (including PyCUDA) to support a hierarchical Dirichlet clustering algorithm. As a result, it was a bear to maintain and install, especially just to read FCS files, so we decided to split it up.

FlowIO didn't need to interact with event data, so no dependencies were required (not even NumPy). As the package name implies, it also writes FCS files. It is a simple package with one FlowData class and a few functions. I admit there are pros & cons to this lean approach, but it has proven very useful.

FlowUtils became home to utility functions related to compensation and transformation. Later it would expand, but always focused on well-defined functions that would not likely change over time and/or would benefit from optimization. It has C extensions to make things as fast as possible, and in pure C to reduce dependencies. The C extensions mean users would need compilers, so pre-built binaries in PyPI are needed. Having well-defined functions that don't change allows for a stable package where pre-built binaries can be offered without too much of a maintenance burden for the combination of operating systems & Python versions. It stayed as a collection of functions rather than classes for ease of testing. It can also be easily wrapped by classes (see FlowKit.transforms).

Converting those FlowData lists to NumPy arrays, then applying compensation and transforms...it was too cumbersome. We wanted something more convenient, but we also had another idea. If we could just get event data for live single cells, or for this other project, maybe get just the cytokine positive populations. Then we could offer better statistical analysis, and often a given clustering algo would perform much better on this cleaner subset of data. This was the prize for all the gating code...those algorithms just work so much better given clean data. Yes, there are other concerns (batch effects, etc.) but this is one step closer. This was the goal for FlowKit, but with the caveat that we didn't want to bundle those algorithms directly into FlowKit (to avoid the volatility of chasing trends).

There's quite a lot more I could add about what I've learned, what worked and what didn't, but wanted to chime in with some of the history and add that I would also be willing to help out with a scverse flow project. I do have ongoing plans for the libraries I maintain, next up being FCS 3.2 support in FlowIO. Another idea is to add AnnData export to FlowKit, maybe as an optional dependency. Lots of possibilities! Looking forward to more discussions.

Kind regards, Scott

grst commented 7 months ago

Hi @whitews,

thanks for your insights! How does FlowIO relate to FCSParser, which I believe is used by readfcs/pytometry at the moment?

I think there has been some consensus to move on with pytometry within scverse and of course every input from your side is welcome and interoperability with your set of packages highly appreciated :)

whitews commented 7 months ago

Hi Gregor,

There's no relation between FlowIO and FCSParser that I am aware of. FlowIO is actively maintained since it is used by FlowKit. The same for FlowUtils.

As mentioned above, since FlowIO is targeted as a reader/writer, I recommend using the Sample class from FlowKit for any analysis on event data since any preprocessing is taken care of and it has convenient methods for compensation, transformation, and basic visualization. Using FlowKit also provides support for the standard gates defined in the GatingML 2.0 specification with a gating strategy implementation that a lot of thought and testing went into. And you'd get FlowJo wsp parsing. I know the traditional gating stuff isn't the aim of this proposed project, but for reasons I stated above, my opinion is that they can be a powerful tool when used together. Ok, that's my last sales pitch for FlowKit!

In any case, I'm happy to help in any way I can. -Scott

grst commented 6 months ago

Pytometry has been moved to scverse/pytometry and @quentinblampey has been added as an additional maintainer 🎉

I encourage everyone here to open issues in the pytometry repo to discuss improvements and further directions, also with respect to interoperability with @whitews's set of tools!

Closing this issue, as a way forward has been decided and more technical discussions can happen in the pytometry repo.

quentinblampey commented 6 months ago

Thanks @grst, I'll open a few issues in the next few days!