kaizhang / SnapATAC2

Single-cell epigenomics analysis tools
https://kzhang.org/SnapATAC2/
197 stars 20 forks source link

obs_names resets to original barcodes when assigned via AnnDataset #278

Open newtonharry opened 3 months ago

newtonharry commented 3 months ago

Hi,

So I'm currently working on a multi-sample pipeline and I've followed your tutorial. I currently have to rename barcodes such that they're the same across snATAC-seq and snRNA-seq data. I'm appending the sample to the beginning of the barcode as such:

E14_data_directory = f"{data_directory}/snATAC-seq/E14"
E18_data_directory = f"{data_directory}/snATAC-seq/E18"

def generate_file_paths(file_names, dir):
    input_files = [f"{dir}/fragments/{file_name}.tsv.gz" for file_name in file_names]
    adata_samples = [f"{dir}/anndata/{file_name}.h5ad" for file_name in file_names]

    return input_files, adata_samples

E18_file_names = ["E18_Hom_1", "E18_Hom_2", "E18_Con_1", "E18_Con_2"]
E18_input_files, E18_adata_samples = generate_file_paths(
    E18_file_names, E18_data_directory
)

E14_file_names = ["E14_Hom_1", "E14_Hom_2", "E14_Con_1", "E14_Con_2"]
E14_input_files, E14_adata_samples = generate_file_paths(
    E14_file_names, E14_data_directory
)

combined_file_names = [*E14_file_names, *E18_file_names]
combined_input_files = [*E14_input_files, *E18_input_files]
combined_adata_samples = [*E14_adata_samples, *E18_adata_samples]

# NOTE: ONLY set this to true if you have RAM mounted as a file system (like /dev/shm)
use_ram = False

# Genome to use
chrom_sizes = snap.genome.mm39.chrom_sizes

if use_ram:
    os.mkdir("/dev/shm/snap_tmp_dir")
    adatas = snap.pp.import_data(
        combined_input_files,
        sorted_by_barcode=False,
        file=combined_adata_samples,
        tempdir="/dev/shm/snap_tmp_dir",
        chrom_sizes=snap.genome.mm10,
        min_num_fragments=1000,
    )
else:
    adatas = snap.pp.import_data(
        combined_input_files,
        sorted_by_barcode=False,
        file=combined_adata_samples,
        tempdir=f"{data_directory}/tmp",
        chrom_sizes=snap.genome.mm10,
        min_num_fragments=1000,
    )

for adata, sample_name in zip(adatas, combined_file_names):
    adata.obs_names = [f"{sample_name}:{barcode}" for barcode in adata.obs_names]

atac_data_adataset = snap.AnnDataSet(
        adatas=[(name, adata) for name, adata in zip(combined_file_names, adatas)],
        filename=f"{data_directory}/snATAC-seq/atac_brain_data.h5ads"
    )

This code successfully renames each barcode within each sample and the renamed barcodes persist when I load it into an AnnDataSet object. When I try to do that same to an AnnDataSet object, the indexes reset when I reload it. I think there must be no synchronisation happening between obs_names and obs.index in your pyanndata. Perhaps it's a good idea to be able to rename the obs_names from an AnnDataSet object instead of having to load each AnnData and assign it there.