ArchR vs SnapATAC2 embeddings

zia1138 commented 1 year ago

Dear @kaizhang, We have a data set where we are seeing very different embedding results on the UMAP for ArchR as compared to SnapATAC2. Any suggestions on how to make the two approaches comparable? Thanks!

kaizhang commented 1 year ago

Can you share the result or the data with me? So that I can investigate the issues.

zia1138 commented 1 year ago

Thanks @kaizhang! Here is very similar data set: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE171943 in particular the following example from mouse: GSM5238385_ME11_50um.fragments.tsv.gz It's a challenging data set but ArchR provides more sensible results.

kaizhang commented 1 year ago

Thanks for bringing up this issue! I found that for certain datasets the default bin size (500bp) doesn't work well, and it seems to be the case here. Changing the bin size to 5kb seems to give a better result.

I haven't run ArchR to make the comparison. But if ArchR doesn't have such an issue for 500 bp bins, then it is likely because of its feature selection procedure. ArchR uses a very aggressive feature selection procedure, while SnapATAC2 takes an extremely conservative approach. For a small dataset like this, ArchR's feature selection may make better sense. This is just my guess. I will investigate further to confirm when I have more time.

zia1138 commented 1 year ago

Hi @kaizhang Thanks for investigating this!!! I can't seem to replicate your UMAP. Here's how I generated mine:

snap.pp.add_tile_matrix(data, bin_size=5000)                                                                                                                                                                                                                                                                                                                                               
snap.pp.select_features(data)                                                                                                                                        
snap.tl.spectral(data)                                                                                                                                                                        
snap.pp.knn(data)                                                                                                                                                                             
snap.tl.umap(data)                                                                                                                                                                            
snap.tl.leiden(data)                                                                                                                                                                          
snap.pl.umap(data, color="leiden", out_file="umap.pdf")

Could you share yours? I must be missing something.

kaizhang commented 1 year ago

I'm using the developmental version. One of the key improvements in dev version is that you don't have to mannually select top eigenvectors, which is critical to getting the best result in older versions.

Untitled

kaizhang commented 1 year ago

If I use the features selected by ArchR, then I can produce a result similar to ArchR's. Here is the code to reproduce the plot below:

import snapatac2 as snap
data = snap.pp.import_data(
    "GSM5238385_ME11_50um.fragments.tsv.gz",
    genome=snap.genome.mm10,
    sorted_by_barcode=False,
)
snap.pp.filter_cells(data, min_tsse=4, min_counts=10000, max_counts=100000)
snap.pp.add_tile_matrix(data)
snap.pp.select_features(data, whitelist="features.bed") # 'features.bed' contains features selected by ArchR
snap.tl.spectral(data)
snap.tl.umap(data)
snap.pp.knn(data)
snap.tl.leiden(data)
snap.pl.umap(data)

I will implement an alternative feature selection algorithm similar to ArchR's in SnapATAC2. So users will be able to compare the pros and cons of different feature selection algorithms.

zia1138 commented 1 year ago

Sweet. Thank you @kaizhang This is fantastic!

kaizhang commented 1 year ago

I've done more comparisons. Here is the conclusion:

The differences you have observed mainly come from that ArchR and SnapATAC2 use different feature sets. When using the same features, they lead to comparable embedding, but SnapATAC2's dimension reduction algorithm detects finer structures, consistent with our observations in benchmarking datasets.

Untitled

zia1138 commented 1 year ago

Thanks @kaizhang! That is an interesting finding. Look forward to seeing more benchmarks when you publish this great work.

gokceneraslan commented 1 year ago

I will implement an alternative feature selection algorithm similar to ArchR's in SnapATAC2

Shall we keep this issue open until this is implemented? Or do you want to open a new one for this @kaizhang ?

kaizhang commented 1 year ago

@gokceneraslan I will open a new one. Thanks!

kaizhang commented 1 year ago

@gokceneraslan @zia1138 The new feature selection method is implemented and I opened a new issue here: #111

gokceneraslan commented 1 year ago

Thanks!

kaizhang / SnapATAC2

ArchR vs SnapATAC2 embeddings #109