Closed zia1138 closed 1 year ago
Can you share the result or the data with me? So that I can investigate the issues.
Thanks @kaizhang! Here is very similar data set: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE171943 in particular the following example from mouse: GSM5238385_ME11_50um.fragments.tsv.gz It's a challenging data set but ArchR provides more sensible results.
Thanks for bringing up this issue! I found that for certain datasets the default bin size (500bp) doesn't work well, and it seems to be the case here. Changing the bin size to 5kb seems to give a better result.
I haven't run ArchR to make the comparison. But if ArchR doesn't have such an issue for 500 bp bins, then it is likely because of its feature selection procedure. ArchR uses a very aggressive feature selection procedure, while SnapATAC2 takes an extremely conservative approach. For a small dataset like this, ArchR's feature selection may make better sense. This is just my guess. I will investigate further to confirm when I have more time.
Hi @kaizhang Thanks for investigating this!!! I can't seem to replicate your UMAP. Here's how I generated mine:
snap.pp.add_tile_matrix(data, bin_size=5000)
snap.pp.select_features(data)
snap.tl.spectral(data)
snap.pp.knn(data)
snap.tl.umap(data)
snap.tl.leiden(data)
snap.pl.umap(data, color="leiden", out_file="umap.pdf")
Could you share yours? I must be missing something.
I'm using the developmental version. One of the key improvements in dev version is that you don't have to mannually select top eigenvectors, which is critical to getting the best result in older versions.
If I use the features selected by ArchR, then I can produce a result similar to ArchR's. Here is the code to reproduce the plot below:
import snapatac2 as snap
data = snap.pp.import_data(
"GSM5238385_ME11_50um.fragments.tsv.gz",
genome=snap.genome.mm10,
sorted_by_barcode=False,
)
snap.pp.filter_cells(data, min_tsse=4, min_counts=10000, max_counts=100000)
snap.pp.add_tile_matrix(data)
snap.pp.select_features(data, whitelist="features.bed") # 'features.bed' contains features selected by ArchR
snap.tl.spectral(data)
snap.tl.umap(data)
snap.pp.knn(data)
snap.tl.leiden(data)
snap.pl.umap(data)
I will implement an alternative feature selection algorithm similar to ArchR's in SnapATAC2. So users will be able to compare the pros and cons of different feature selection algorithms.
Sweet. Thank you @kaizhang This is fantastic!
I've done more comparisons. Here is the conclusion:
The differences you have observed mainly come from that ArchR and SnapATAC2 use different feature sets. When using the same features, they lead to comparable embedding, but SnapATAC2's dimension reduction algorithm detects finer structures, consistent with our observations in benchmarking datasets.
Thanks @kaizhang! That is an interesting finding. Look forward to seeing more benchmarks when you publish this great work.
I will implement an alternative feature selection algorithm similar to ArchR's in SnapATAC2
Shall we keep this issue open until this is implemented? Or do you want to open a new one for this @kaizhang ?
@gokceneraslan I will open a new one. Thanks!
@gokceneraslan @zia1138 The new feature selection method is implemented and I opened a new issue here: #111
Thanks!
Dear @kaizhang, We have a data set where we are seeing very different embedding results on the UMAP for ArchR as compared to SnapATAC2. Any suggestions on how to make the two approaches comparable? Thanks!