kaizhang / SnapATAC2

Single-cell epigenomics analysis tools
https://kzhang.org/SnapATAC2/
197 stars 20 forks source link

AnnDataSet overwritten #261

Closed TingTingShao closed 4 months ago

TingTingShao commented 4 months ago

Dear,

Sorry to ask again on I/O question.

In order to fine tune the parameters, each time when I run the snakemake workflow, I had a different file name set up for the AnnDataSet which I will be used to post-process later on Jupyter notebook.

However, each time even with different file name, otehr AnnDataSets will get influenced. Could I please ask why this could be and how to solve this.

rule export_data:
    input: 
        processed_adatas=expand("014h5ad_processed_features/{name}.h5ad", name=sample_names)
    output:
        AnnDataSet=AnnDataSet
        # temp="temp/{name}.h5ad"
        # png_bfr_umap="020QC/umap.bfr.batch.png",
        # png_aft_umap_har="020QC/umap.aft.batch.har.png",
        # png_aft_umap_mnn="020QC/umap.aft.batch.mnn.png",
        # png_umap_mnn="040results/umap_mnn.png",
        # png_umap_har="030results/umap_har.png"
    run:
        # import snapatac2 as snap
        import numpy as np
        # print(sample_names)
        # print(input.processed_adatas) 
        filtered_adatas=[]
        for name, file in zip(sample_names, input.processed_adatas):
            # print(name)
            adata=snap.read(file, backed='r+', backend='hdf5')
            # adata_copy = adata.copy(filename=output.temp, backend=None)
            name = '014h5ad_processed_features/' + name
            # print(name)
            filtered_adatas.append((name, adata))
        # print(sample_names)
        # print(filtered_adatas)
        data=snap.AnnDataSet(
            adatas=filtered_adatas,
            filename=output.AnnDataSet
        )
        print(f'Number of cells: {data.n_obs}')
        print(f'Number of unique barcodes: {np.unique(data.obs_names).size}')

        unique_cell_ids = [sa + ':' + bc for sa, bc in zip(data.obs['sample'], data.obs_names)]
        data.obs_names = unique_cell_ids
        assert data.n_obs == np.unique(data.obs_names).size
        data.close()
kaizhang commented 4 months ago

Using SnapATAC2 with workflow management software like snakemake is a bit tricky. One issue with this is that SnapATAC2 will modify the input h5ad file when opening it in backed mode. To avoid this, you can (1) make a hard copy of the input h5ad file, or (2) read the h5ad file in in-memory mode (backed=None), or (3) in read-only mode (backed='r') if you don't need to modify it.

TingTingShao commented 4 months ago

Considering I am gonna have about 200 fragments files to process, do you have recommendations for snapatac2 workflow management?

Now I start to run with 92 fragments files to generate the anndataset, but some of the files have problem with the importing, only 74 have been successfully imported to .h5ad.

PanicException in file /vsc-hard-mounts/leuven-data/351/vsc35107/snap2_pipeline/Snakefile, line 54:
called `Result::unwrap()` on an `Err` value: Custom { kind: Other, error: "MissingStartPosition: c" }

I randomly selected three out of 18, and rerun the snakemake workflow, two of them have been successfully imported into .h5ad file, one of them showed the same error mentioned above. I used ArchR pacakage to convert that specific file into arrow file, and it was successful. The QC plot from archr is attached.

D19-13001-Fragment_Size_Distribution.pdf D19-13001-TSS_by_Unique_Frags.pdf

Any ideas on what is the potential issue?

Thanks in advance! tingting

TingTingShao commented 4 months ago

Using SnapATAC2 with workflow management software like snakemake is a bit tricky. One issue with this is that SnapATAC2 will modify the input h5ad file when opening it in backed mode. To avoid this, you can (1) make a hard copy of the input h5ad file, or (2) read the h5ad file in in-memory mode (backed=None), or (3) in read-only mode (backed='r') if you don't need to modify it.

rule add_tsse:
    input:
        h5ad="010h5ad/{name}.h5ad",
    output:
        tsse_h5ad="011h5ad_tsse/{name}.h5ad"
    run:
        # import snapatac2 as snap
        adata=snap.read(input.h5ad, backed="r+", backend=None)
        adata_copy = adata.copy(filename=output.tsse_h5ad, backend=None)
        snap.metrics.tsse(adata_copy, snap.genome.hg38)
        print(adata_copy)
        adata_copy.close()

I did like sth this, look like it is working.

thanks tingting

kaizhang commented 4 months ago

It's better to change backed='r+' to backed='r', since you don't need write permission.

TingTingShao commented 4 months ago

Thanks!

I thought I need to write to the copy, so I set as r+. But I understand what you meant.

TingTing