scverse / scvi-tools

Deep probabilistic analysis of single-cell and spatial omics data
http://scvi-tools.org/
BSD 3-Clause "New" or "Revised" License
1.22k stars 346 forks source link

PeakVI batch effect correction problems #1573

Closed cflerin closed 2 years ago

cflerin commented 2 years ago

Hi, thanks for developing this nice package! I'm following the tutorial steps for PeakVI with scATAC-seq data, but using two samples to test the batch effect correction. When I complete all of the steps, it doesn't appear that any correction has happened, and my samples are still clearly separated in the UMAP. I used this same procedure previously with scRNA data using scVI model and it worked very well, so I'm not sure if I am doing something wrong. Is using the filtered_peak_bc_matrix possibly the cause of this? Perhaps there are not enough overlapping peaks between the two samples...

adata0 = scvi.data.read_10x_atac('sample0/outs/filtered_peak_bc_matrix/')
adata1 = scvi.data.read_10x_atac('sample1/outs/filtered_peak_bc_matrix/')
adata = anndata.concat([adata0, adata1], join='outer', index_unique='___', label='batch_key')
adata
# AnnData object with n_obs × n_vars = 20086 × 309096
#     obs: 'batch_id', 'batch_key'

### filter
min_cells = int(adata.shape[0] * 0.05)
sc.pp.filter_genes(adata, min_cells=min_cells)
print(adata.shape)
# (20086, 25362)

scvi.model.PEAKVI.setup_anndata(
    adata,
    batch_key='batch_key',
)
pvi = scvi.model.PEAKVI(adata)
pvi.train()
# Epoch 147/500:  29%|██▉       | 147/500 [07:02<16:54,  2.87s/it, loss=8.52e+07, v_num=1]
# Monitored metric reconstruction_loss_validation did not improve in the last 50 records. Best score: 5729.408. Signaling Trainer to stop.

adata.obsm['X_PeakVI'] = pvi.get_latent_representation()
sc.pp.neighbors(adata, use_rep="X_PeakVI")
sc.tl.umap(adata, min_dist=0.2)
sc.pl.umap(adata, color='batch_key')

image

There is no error but I expected the batch effect correction to work a bit better.

Versions:

0.16.4

adamgayoso commented 2 years ago

@talashuach any thoughts? perhaps you need to consider making sure both your datasets are using the same feature definitions in terms of regions?

cflerin commented 2 years ago

I think I can answer my own question here -- this definitely seems to have been caused by the regions that I used. When loading the Cell Ranger outputs, there is only a filtered subset of the peaks present in the filtered_peak_bc_matrix output (it's the same set of peaks for raw). So when merging these from two samples there is very little overlap. In the above example it was something like 100 overlapping peaks, so it's not surprising that the samples could not be integrated.

When I try again with a better set of common regions, I get a much better result: image This was done by starting from the fragments files, counting into a cell x peaks matrix with the Encode SCREEN regions (2M+ regions in human), then doing the merge and subsequent region filtering. Here I have more like 30k regions with peaks in both datasets (nearly 100% of the filtered regions).

So, just a caution to anyone starting from the Cell Ranger output, it is probably necessary to go back and start from the fragments files. Thanks again for this amazing tool!