Genes and samples switched in goodSamplesGenes

lorenzoamir commented 10 months ago

Hi, thanks a lot for the package,

I just noticed that good samples and and good genes seem to be switched in goodSamplesGenes. This could be an issue with the documentation and not the function itself.

I am working with an anndata object with shape n_obs × n_vars = 79 × 19013 (79 samples, 19013 genes). Following the documentation, goodSamplesGenes takes "A data frame in which columns are genes and rows are samples" and returns "A triple containing (goodGenes, goodSamples, allOK)". But when I run

good_genes, good_samples, all_ok = WGCNA.goodSamplesGenes(datExpr=pd.DataFrame(adata.X))

good_genes has shape 79 and good_samples has 19013. So the documentation should probably be changed to "A triple containing (goodSamples, goodGenes, allOK)". Or maybe it's just me mixing up rows and columns, but my dataframe looked like the ones that were shown in the tutorials.

EDIT: made anndata object name more clear and fixed typos

lorenzoamir commented 10 months ago

I tried looking into this issue hoping that I could maybe fix if it was just a matter of switching 2 variable names. And found this in the documentation of the WGCNA class:

class WGCNA(GeneExp): """ A class used to do weighted gene co-expression network analysis. [ . . . ] :param anndata: if the expression data is in anndata format you should pass it through this parameter. X should be expression matrix. var is a sample information and obs is a gene information. :param anndata: anndata

So :param anndata: is repeated twice (the second time it should probably be :type anndata:), but the documentation states that genes should be stored in the obs field and samples should be in the var field. It is a bit counterintuitive, because, as far as I know, it is usually the other way round (obs = samples and var = genes), but this could explain why they were switched in my previous example.

However in the PyWGCNA_object tutorial it is stated that the data is stored "in AnnData format which rows/obs are samples/sample information and cols/var are genes/gene information" and indeed it is shown that the .obs field contains samples and the .var field contains genes.

I'm a bit confused and don't know if I should transpose my data or not anymore, this can be easily solved by using dataframes instead of anndata objects, but maybe it should be clarified a bit more clearly since an anndata option is available?

nargesr commented 10 months ago

Hi @lorenzoamir ,

Thank you for mentioning this!

I'll look into this matter in the next few days and update the related documents :)

meanwhile, as you mentioned you can pass data in separate tsv/csv files.

nargesr commented 10 months ago

Hi @lorenzoamir,

Sorry for the confusion! I was trying to use mostly the similar function/format of input that has been used in the original WGCNA in R.

you were right about the API documentation part. I fixed the documentation so X should be an expression matrix. var is a gene information and obs is a sample information.

for goodSamplesGenes, if you look at the wrapper function (preprocess()) for the preprocessing steps you can see that I transposed the expression matrix before passing it to the goodSamplesGenes() function.

goodGenes, goodSamples, allOK = WGCNA.goodSamplesGenes(self.datExpr.to_df().T)

I just updated the documentation! It should reflect on the website in the next few minutes.

Sorry again for the confusion and thanks for pointing it out Please feel free to reopen this issue if there is still a problem.

mortazavilab / PyWGCNA

Genes and samples switched in goodSamplesGenes #88