Interest in HTO/ADT demultiplexing and analysis?

wflynny commented 5 years ago

I'm starting to regularly analyze samples multiplexed with HTOs a la Stoeckius et al and multimodal CITE-seq ADTs. I have some rough ports of Seurat's HTODemux and some other functionality.

Is anyone else analyzing these kinds of experiments using scanpy and is there interest to bring functionality to do so into the main scanpy branch?

falexwolf commented 5 years ago

This is very interesting! It would be awesome if you linked to a small example to learn what you do exactly! I guess, a PR would then be more than welcome! 🙂

wflynny commented 5 years ago

@falexwolf Mostly just following along with the Seurat vignettes here using scanpy:

Mostly I'm using CITE-seq-count to generate counts matrices then storing the ADT counts in adata.obsm["X_adt"] and HTO counts in adata.obsm["X_hto"]. From there, I generate additional metadata about the cells which I store in either adata.obs (e.g, for HTOs, adata.obs.global_classification stores ["singlet", "doublet", "negative"] and adata.obs.tag_class stores ["hashtag_1", "hashtag_2", ...]).

I'd be happy to contribute a PR in the next couple weeks. A quick question though: the workflow involves some IO, normalization, classification, and plotting, so where would be the best entry point for this? I was thinking about stashing all related functionality in a multimodal.py module, but then should they be part of the tools api or a separate api like sc.mm? Or should I spread out the functionality across the existing pp, tl, pl apis?

Leprechaun777 commented 5 years ago

Hi, I have lots of interest in adding ADT and HTO functionality as well.

Thank you!

njbernstein commented 5 years ago

I've got some code to do this. I'll try to submit a PR soon

gokceneraslan commented 4 years ago

I'm also interested in this since I'll be analyzing some HTO data soon.

As I wrote here, I think we should also discuss the I/O and storage procedures for ADT/HTOs.

@wflynny it makes a lot of sense to use adata.obsm["X_adt"] and adata.obsm["X_hto"] for ADT and HTO counts. One caveat is that we cannot store ADT/HTO barcode strings in adata.obsm but I don't know how important this is.

For I/O, we can define a sc.read_antibody_tags(filename) that reads HTO/ADTs into the adata.obsm['X_hto']. Then a simple sc.pp.classify_hashtags() method can determine classes and creates new fields like HTO_class in adata.obs.

@wflynny @njbernstein what do you think? @wflynny what else do you think is needed for a nice HTO/ADT pipeline?

wflynny commented 4 years ago

@gokceneraslan Hey, sorry for my long silence on this.

I've been using @Hoohm's https://github.com/Hoohm/CITE-seq-Count for ADT/HTO tag counting which produces (in recent versions) a 10X v3 style mtx directory for both reads and UMIs. In some cases, I'll load these in as their own AnnData object with reads and counts as different layers which is helpful in computing per-cell or per-tag "sequencing saturation" and other metrics involving both reads and counts. This is especially helpful for investigating some pilot experiments (lipid tags, cholesterol tags, etc.) we've been doing.

However, most of the time I'll just load the tags matrix in as a pandas dataframe and run them through a demuxing function that'll modify adata.obs.

A couple challenges/ideas to consider:

at our facility, we're typically building the same Illumina i7 index (ATTACTCG) into all tag libraries. This leads to some tricky situations when using a NovaSeq for sequencing since the multiple tag libraries (with disjoint sets of tags) may be run on the same sequencing flowcell lane. This results in a single set of FASTQ files and thus a single barcode-tag matrix for all tag libraries on that lane. Therefore, the mapping between transcriptome AnnData objects <-> tag library matrices is not always 1-to-1.
in my experience, HTO libraries have a large variance in quality, so for the most part I've been using the transcriptome as my "ground truth" as to what is a cell. However, I imagine others use HTOs to "rescue" cells that were not called by their pipeline of choice (and I hope to do this once I build enough trust in the data). In that case, one would want to intersect the HTO classifications with the raw cell-gene matrix.
not all tags are antibody based, so I'd vote for naming all related functions *hashtags().

I'd therefore vote for something like the following design:

# htos is a AnnData object
htos = sc.read_hashtags(filename) 

# classify_hashtags adds a classification to the hto AnnData object
# kwargs might involve things like `use_tags=["tag1", "tag2", "tag3"]`
sc.pp.classify_hashtags(htos, **kwargs)
print(htos.obs.classification) 

# demuxing cell-gene matrix(es) could then be done like
rna1 = sc.read_10x_h5(...)
rna2 = sc.read_10x_h5(...)
# sc.pp.demux_by_hashtag(adata_hto, *adata_rna, tag_groups=None, ...)
sc.pp.demux_by_hashtag(
    htos, 
    rna1, rna2, 
    tag_groups=[("tag1", "tag3", "tag5"), ("tag2", "tag4", "tag6")]
)

@gokceneraslan This is more complex than what you suggested, but I think is sufficiently general to cover my needs as listed above. Let me know what you think---I'll have some development time next week to possible contribute to this.

wflynny commented 4 years ago

Any consensus here on how HTO/ADT information should be stored within the AnnData object?

Relevant threads:

njbernstein commented 4 years ago

I'm happy to implement my method, when we have a consensus.

https://github.com/calico/solo/blob/master/solo/hashsolo.py

https://www.biorxiv.org/content/10.1101/841981v1

aditisk commented 4 years ago

Hello all,

Has any functionality discussed above and in #797 implemented already ? If yes, where can I find the documentation on the methods ?

Thanks.

njbernstein commented 4 years ago

@aditisk I'm the author of this method https://github.com/calico/solo. it should install relatively easily if you have any issues I'm happy to help. The main functionality it doesn't have is tag_groups so you'd have t manually create that if you have used multiple hashes per group.

aditisk commented 4 years ago

@njbernstein thank you for the suggestion, I'll look into it and reach out if I have any issues.

njbernstein commented 3 years ago

I'd like to tackle this. Can someone tell me how we want to store hashing data in an anndata object? @flying-sheep @fidelram

I'll take back up porting hashsolo to scanpy

wflynny commented 3 years ago

I've been using a rough python implementation of Chris McGinnis's MULTIseq demuxing code for all my multiplexed experiments. This algorithm has been incorporated into Seurat as an alternative to their default HTODemux function. This recent preprint suggests it's one of the better algorithms for sample demultiplexing. I recently put my implementation on GitHub here: https://github.com/wflynny/multiseq-py.

Is there interest in including this in scanpy.external in addition to solo? If so, I can invest effort into cleaning up the implementation, adding tests, etc.

fidelram commented 3 years ago

If your implementation already uses scanpy, the best is to keep it in your repository and we can link to it from scanpy (see https://scanpy.readthedocs.io/en/stable/ecosystem.html).

I did some work on HTOs in the past and for me what worked best was to fit a gaussian mixture but I had not followed the new methods. Something that helped was to visualize the results as follows (each row a different barcode, x axis = log HTO):

If you are interested I can share the code with you.

wflynny commented 3 years ago

@fidelram Sounds good---I'll update the code and then open a PR to get it added to the ecosystem docs.

In my experience with HTOs (and now LMOs & CMOs), GMMs and even poisson/negative binomial mixture models don't work particularly well for all experiments as they tend to only call 50-70% of cells as singlets/multiplets. The remaining "negatives" or uncalled cells can really hamper some experimental designs (like when tags correspond to different conditions/perturbations). Anecdotally, multiplexing seems substantially more difficult to get right for tissues rather than blood or cell/organoid lines.

That said, I'd be interested in any plotting code you could share :). I very much appreciate all the plotting functionality you've implemented in scanpy!

njbernstein commented 3 years ago

@wflynny hashsolo allows you to set a prior for your expected rate negatives, singlets, and doublets, which helps quite a bit with the issue you described despite modeling the log CMO counts as a normal distribution. Additionally, you can also add cell types if you've done cell-type annotation or even leiden clustering labels to help with cell type variability with CMO counts. This helped me quite a bit in kidney where NK cells had far fewer CMO counts than other cells despite being apparently live cells, e.g. good gene diversity and low mitochondrial gene percentage.

@fidelram I'd be happy to add a visualization tool like you suggested if you have the code laying around.

wflynny commented 3 years ago

@njbernstein Probably not the right place for this discussion, but a couple of follow-up questions for you:

Do you happen to have a benchmark of hashsolo vs the other demuxing algos? It's been on my todo list for ..a while.. but still haven't gotten around to doing it. I've seen the benchmarks of the doublet finding capabilities of solo and they look good. As a user of scrublet, it'd be nice to have one tool/codebase that handles both transcriptomic and tag multiplets.
Are you open to PRs? I'd at least like to have functionality to generate the initial h5ad object containing the tag counts from the output of CITE-seq-Count.
Regarding non-antibody tags, have you noticed celltype-specific preferential binding? I've had problems with LMO/CMOs where not tagging particular celltypes (like some epithelial subtypes where we had 2-3 orders of magnitude lower tag counts).

njbernstein commented 3 years ago

In supplementary figure 9 of our paper, I did a light comparison of tools using the demuxlet data as ground truth: https://www.cell.com/cms/10.1016/j.cels.2020.05.010/attachment/040c239d-1e70-42a4-8974-9fbd75c65551/mmc1.pdf Which I think is a fine first stab at getting at this comparison, but it could be better. Hashsolo performance was comparable with other methods but is able to recover cell types with lower CMO counts.

I think that sounds great.

That's an issue we had as well, but I noticed it occurring for NK cells in kidney Screen Shot 2021-01-13 at 9 18 30 AM

njbernstein commented 3 years ago

@wflynny on my to-do list is to add solo to scVI directly and then scanpy will be a nice play to do both types of doublet finding.

brianpenghe commented 3 years ago

Any updates on this thread?

njbernstein commented 3 years ago

@brianpenghe hashsolo is implementer in scanpy now

brianpenghe commented 3 years ago

@brianpenghe hashsolo is implementer in scanpy now

That would be awesome. Which version of Scanpy includes hashsolo? Any Scanpy tutorials?

Zethson commented 6 months ago

@brianpenghe solo is here at least: https://docs.scvi-tools.org/en/stable/api/reference/scvi.external.SOLO.html Don't think hashsolo is anywhere yet though

njbernstein commented 6 months ago

@Zethson hashsolo is in scanpy already in the external api. Let me know if you have any questions about it !

Lucas-Maciel commented 2 weeks ago

Hi @njbernstein

I'm using scanpy version 1.10.2 and when I tried to run hashsholo but it's not working and I'm not sure why

Any suggestions? Thank you

Edit: htos is just the list with he two hashtag names shown in the obs

njbernstein commented 2 weeks ago

@Lucas-Maciel Can you try setting the number_of_noise_barcodes = 1?

e.g. scanpy.external.pp.hashsolo(adata, cell_hashing_columns, *, priors=(0.01, 0.8, 0.19), pre_existing_clusters=None, number_of_noise_barcodes=1) https://scanpy.readthedocs.io/en/stable/generated/scanpy.external.pp.hashsolo.html

scverse / scanpy

Interest in HTO/ADT demultiplexing and analysis? #351