immunogenomics / starCAT

Implements *CellAnnotator (aka *CAT/starCAT), annotating scRNA-Seq with predefined gene expression programs
MIT License
18 stars 3 forks source link

input reference and score path #4

Closed Bo-UT closed 3 months ago

Bo-UT commented 4 months ago

Hi,

Thanks for the cool tool! I would like to use our own dataset as the reference. Could you remind me how to use the output of cNMF as the reference? Looks like it needs a .yaml file as the score data, while there isn't a .yaml file in cNMF output folder. Thank you in advance.

dylkot commented 4 months ago

Hi @Bo-UT, thanks for trying it out!

There is an example of using cNMF in the build_reference_vignette.ipynb example. There aren't any scores that are output by cNMF. So doing this just fits the usage of the programs learned by cNMF on new datasets. Any scores derived from this new reference would need to be made by you based on your interpretation and analysis of the cNMF output and you would need to make your own scores .yaml file.

Does that make sense?

Bo-UT commented 4 months ago

Great! Thanks a lot for your prompt response. I tried

tcat = starCAT(reference='/path/to/test.gene_spectra_score.k_10.dt_0_02.txt',
            cachedir='./cache')
usage, _ = tcat.fit_transform(adata)

However, I encountered another error ValueError: Negative values in data passed to NMF (input H)

ValueError File ~/miniconda3/envs/single-cell/lib/python3.10/site-packages/sklearn/decomposition/_nmf.py:83, in _check_init(A, shape, whom) 78 if shape[1] != "auto" and A.shape[1] != shape[1]: 79 raise ValueError( 80 f"Array with wrong second dimension passed to {whom}. Expected {shape[1]}, " 81 f"but got {A.shape[1]}." 82 ) ---> 83 check_non_negative(A, whom) 84 if np.max(A) == 0: 85 raise ValueError(f"Array passed to {whom} is full of zeros.") File ~/miniconda3/envs/single-cell/lib/python3.10/site-packages/sklearn/utils/validation.py:1650, in check_non_negative(X, whom) 1647 X_min = xp.min(X) 1649 if X_min < 0: -> 1650 raise ValueError("Negative values in data passed to %s" % whom)

I do check the input adata and scaled adata (sc.pp.scale(adata, zero_center=False)), and all the values are non-negative. Do you have an insight why this error shows? Seems other people also met this kind of error . Thank you.

Bo-UT commented 4 months ago

Hi,

The error is caused by the reference "test.gene_spectra_score.k_10.dt_002.txt" which has negative values. It's fixed by using `spectra*.consensu.txt ` from cNMF output.

tcat = starCAT(reference='/path/to/test.spectra.k_10.dt_0_02.consensus.txt',
            cachedir='./cache')
usage, _ = tcat.fit_transform(adata)

spectra_*consensus.txt is an intermediate file #59, and it should be safe to use? or would you recommend to use gene_spectra_tpm*.txt? Thank you.

michelle-curtis commented 4 months ago

We actually found it is best to use a variance-normalized version of the gene_spectra_tpm file, subset to include only the highly variable genes. This will be non-negative, so it won't throw the error you previously got. We've implemented this type of output in the development branch of cNMF, which you can install using pip install git+https://github.com/dylkot/cNMF.git@development.

If you are able to rerun the consensus step of cNMF with the updated version, it will now output a starcat_spectra file which can be used as input into starcat.

Please let us know if you have any issues!

Bo-UT commented 4 months ago

Hi @dylkot , Thank you! Will try and keep you posted. It seems the top genes from gene_spectra_tpm file are different from top genes fromgene_spectra_score file. Which top genes would you recommend to use? Thanks.

dylkot commented 4 months ago

Top genes are typically best from the gene_spectra_score file. Cheers!

Bo-UT commented 4 months ago

Great. Btw, the development branch of cNMF works well. Thanks a lot!

dylkot commented 3 months ago

It is now pushed to the main branch. Thanks!