Open melaniebalaz opened 5 months ago
Hello, melaniebalaz I have some questions about the DCIS1 dataset and I really need your help. In copyvae, only the data of RNA-SEQ are provided for CNV detection. Then the bulk DNA sequencing and whole-exome sequencing analysis methods served as ground truth when carrying out the index middle distance calculation of this method Where can I download reference? I only found the same scRNA data in the copykat paper mentioned in the article. Looking forward to your reply very much. Thank you very, very much. Your reply is important to me.
I was trying copyVAE on two different datasets to get an idea for its performance and noticed a few small things that I wanted to point out, as potential improvement suggestions or in case someone else runs into similar issues. I used a trisomy12 dataset and the DCIS dataset the authors used for benchmarking in the paper.
1. Access to the sorted anndata object as output Since the anndata object gets sorted by the absolute position in
bin_genes_from_anndata
it would be great to write this sorted adata object to file somewhere. After running copyVAE I want to be able to trace back the resulting matrices to the cells they were originally annotated with. Since they got sorted, the original input order is not maintained anymore and can't be used for labelling. One option would be to just write the "data
" object to file, for example after the following fields are set:Using my trisomy12 dataset, which I imported using the
sc.read_10x_mtx()
function (versus the DCIS dataset which I imported assc.AnnData()
from a pd dataframe) before saving it as .h5ad for copyVAE, I came across a runtime issue:2. Datastructure missmatch The data imported with sc.read_10x_mtx comes in the format
and after calling
todense()
on it stays a matrix:In comparison to the DCIS dataset, which is type
numpy.ndarray
. The code seems to work with the ndarray, but not matrix.Because this further down the line in
auto_corr()
called fromfind_normal_cluster()
leads to a ValueError in the lineres += np.sum(cluster_data[i,:] * cluster_data[j,:])
, because of the dimensions.ValueError: shapes (1,529) and (1,529) not aligned: 529 (dim 1) != 1 (dim 0)
This can be, for example, fixed by converting
Either this or defining the input has to be in array form and not matrix form.
4. Gene name mapping I have noticed that the
gene_ids
in adata are overwritten by thegene_name
Only to later on match on the gene names when mapping, but from the (now overwritten with gene names) gene_ids variable in adata.
Why not match on gene_names in adata, instead of gene_ids and overwriting? Or alternatively matching gene_ids from adata and "Gene Stable ID" from the gene_map? I was trying to set the metadata correctly in the anndata object in my trisomy dataset, to match how the code expects it to be, but this part seemed unnecessarily confusing.
5. Output decimals and not integers ("whole copy numbers") I have noticed that the output from both datasets are fractions and not whole numbers. While the differences in CN, for example in the trisomy12 dataset, are there in a fractional state, they are not really big enough so they wouldn't disappear when rounding to whole copy numbers. Would it make sense to add a functionality to get whole copy numbers?
6. Nice to Have: Add a parameter for setting the output_path directory. Would just make things a bit more convenient:
parser.add_argument("-o", "--output_path", help="output path prefix")