Usability issues - Githubissues

I was trying copyVAE on two different datasets to get an idea for its performance and noticed a few small things that I wanted to point out, as potential improvement suggestions or in case someone else runs into similar issues. I used a trisomy12 dataset and the DCIS dataset the authors used for benchmarking in the paper.

1. Access to the sorted anndata object as output Since the anndata object gets sorted by the absolute position in bin_genes_from_anndata it would be great to write this sorted adata object to file somewhere. After running copyVAE I want to be able to trace back the resulting matrices to the cells they were originally annotated with. Since they got sorted, the original input order is not maintained anymore and can't be used for labelling. One option would be to just write the "data" object to file, for example after the following fields are set:

data.obsm["latent"] = z
data.obsm["copy_number"] = copy_bin

Using my trisomy12 dataset, which I imported using the sc.read_10x_mtx() function (versus the DCIS dataset which I imported as sc.AnnData()from a pd dataframe) before saving it as .h5ad for copyVAE, I came across a runtime issue:

2. Datastructure missmatch The data imported with sc.read_10x_mtx comes in the format

type(data.X)
<class 'scipy.sparse._csc.csc_matrix'>

and after calling todense() on it stays a matrix:

type(x)
<class 'numpy.matrix'>

In comparison to the DCIS dataset, which is type numpy.ndarray. The code seems to work with the ndarray, but not matrix.

Because this further down the line in auto_corr() called from find_normal_cluster() leads to a ValueError in the line res += np.sum(cluster_data[i,:] * cluster_data[j,:]), because of the dimensions. ValueError: shapes (1,529) and (1,529) not aligned: 529 (dim 1) != 1 (dim 0)

This can be, for example, fixed by converting

try:
        x = data.X.todense()
    except AttributeError:
        x = data.X

    if(isinstance(x, np.matrix)):
        x = np.asarray(x)

Either this or defining the input has to be in array form and not matrix form.

4. Gene name mapping I have noticed that the gene_ids in adata are overwritten by the gene_name

  if 'gene_name' in adata.var:
        adata.var['gene_ids'] = adata.var['gene_name']

Only to later on match on the gene names when mapping, but from the (now overwritten with gene names) gene_ids variable in adata.

map_chr = dict(gene_map[['Gene name', 'chr']].values)
    adata.var['chr'] = adata.var.gene_ids.map(map_chr)

Why not match on gene_names in adata, instead of gene_ids and overwriting? Or alternatively matching gene_ids from adata and "Gene Stable ID" from the gene_map? I was trying to set the metadata correctly in the anndata object in my trisomy dataset, to match how the code expects it to be, but this part seemed unnecessarily confusing.

5. Output decimals and not integers ("whole copy numbers") I have noticed that the output from both datasets are fractions and not whole numbers. While the differences in CN, for example in the trisomy12 dataset, are there in a fractional state, they are not really big enough so they wouldn't disappear when rounding to whole copy numbers. Would it make sense to add a functionality to get whole copy numbers?

6. Nice to Have: Add a parameter for setting the output_path directory. Would just make things a bit more convenient: parser.add_argument("-o", "--output_path", help="output path prefix")

mandichen / copyVAE

Usability issues #3