sankaranlab / SCAVENGE

SCAVENGE is a method to optimize the inference of functional and genetic associations to specific cells at single-cell resolution.
GNU General Public License v3.0
79 stars 34 forks source link

TF-IDF issue #4

Closed BenxiaHu closed 2 years ago

BenxiaHu commented 2 years ago

Hello Fulong, SCAVENGE is a good tool deciphering the function of genetic variant at single-cell level. I have 2 questions about the algorithm of SCAVENGE. 1: what is the binarized sparse matrix? 2: you used TF-IDF to calculate the weight for each feature. it seems that the IDF in your paper looks a little different from (log(N/(dfi+1))).

image
fl-yu commented 2 years ago

Hi, Thank you for your interest in our tool!

  1. binarized sparse matrix is the read count matrix of feature(peak)-by-cell with all the counts more than 1 being valued as 1, as such, this matrix is binarized and the values in the matrix are either 1 or 0.

  2. Yes, it is slightly different than standard idf. This version of idf calculation is commonly used in dimension reduction of scATAC-seq and performs well, for example, Cusanovich2018 and Stuart2021. Actually, there are many variants in tf-idf calculation link, I think the performance will be slightly different although I do not have a comprehensive benchmark.

Hope this is helpful. Please let me know if you have any other questions, Thanks!

BenxiaHu commented 2 years ago

thanks a lot for your explanation. would you like to explain a little of how you build a nearest neighbor graph from the LSI matrix of N cells and d leading LSIs (d = 30)? in your paper, it seems that you did not mention how to obtain LSI matrix. maybe I miss some important steps. Best,