the description of read_mtx is not clear

kaizhang / SnapATAC2

Single-cell epigenomics analysis tools

https://kzhang.org/SnapATAC2/

222 stars 24 forks source link

the description of read_mtx is not clear #198

Open beyondpie opened 10 months ago

beyondpie commented 10 months ago

Hi Kai,

I found that the description of read_mtx is not clear.

obs_names: File that stores the observation names.
- obs_name_file would be better, otherwise, it's hard to know it's a file.
- it should be row / column of the read_mtx? From my usage, I think it should be row names. But this is usually not the order of 10x_mtx (see https://scanpy.readthedocs.io/en/stable/generated/scanpy.read_10x_mtx.html). And I think people use read_mtx usually because people wants to read 10x_mtx.
same for var_names.

Maybe we can provide another function like scanpy to directly load mtx from a directory?

Sincerely, Songpeng

kaizhang commented 10 months ago

Good idea! Yes, we can add read_10x_mtx to the package. Do you want to give it a try?

beyondpie commented 10 months ago

Sure. I see scanpy is also our dependence. I can basically call scancp's function, which load data into memory, then create SnapATAC2's AnnData to store it in file. What do you think?

kaizhang commented 10 months ago

scanpy is not a mandatory dependency. It would be better not to rely on scanpy for this.

kaizhang commented 9 months ago

A function called "read_10x_mtx" was added with this commit 1a7f69269514fd9ba4b31c293ca68d0999642782 for reading mtx files generated by 10x genomics. I'm closing this issue now.

beyondpie commented 9 months ago

Thanks, Kai! Sorry for the late response, I am now in the job market...

beyondpie commented 6 months ago

Hi Kai @kaizhang ,

I would suggest we copy the behavior of getting gene symbols by default in scanpy.read_10x_mtx (https://github.com/scverse/scanpy/blob/214e05bdc54df61c520dc563ab39b7780e6d3358/scanpy/readwrite.py#L570):

    genes = pd.read_csv(
        path / f"{prefix}{'genes' if is_legacy else 'features'}.tsv{suffix}",
        header=None,
        sep="\t",
    )
    if var_names == "gene_symbols":
        var_names_idx = pd.Index(genes[1].values)
        if make_unique:
            var_names_idx = anndata.utils.make_index_unique(var_names_idx)
        adata.var_names = var_names_idx
        adata.var["gene_ids"] = genes[0].values
    elif var_names == "gene_ids":
        adata.var_names = genes[0].values
        adata.var["gene_symbols"] = genes[1].values
    else:
        raise ValueError("`var_names` needs to be 'gene_symbols' or 'gene_ids'")

I noticed that in our package, we will get ENSMUSG-like index, but most of the time, I think we need gene symbols.

Thanks! Songpeng