Suggestions on train data construction

bitcometz commented 3 years ago

hello， ItClust is a powerfull tool. I think that it needs to work with a large enough training database. Do you have any suggestions to build the train data? For example,

Do I have to save all the gene variables (vars) or just the highly variable genes?
how many levels of cell classification (celltype) need to be refined to be appropriate?
how I integrate the data obtained from different experimental methods?

Thanks!!!

Best

jianhuupenn commented 3 years ago

Thank you for your interest in ItClust. Regarding your questions:

ItClust has its own method to determine how many HVG to use. Briefly, more cells in the data, more HVGs used. We have a detailed description in the Method section of the paper.
I think this depends on your goal. In ItClust, it learns how many cell types are presented in the training dataset, and then tries to identify these cell types in the target data. Of course, you can modify the cell types in the training data(for example, specify T cells into CD8+T, CD4+T, T helper) to make ItClust learn different information.
You can combine multiple datasets into one training data to include comprehensive cell types. In our paper, we have tested ItClust in this scenario and the performance is pretty good. One thing to notice is that, for cell types presented in multiple datasets, I would prefer using them from only one dataset to avoid the batch effect. For example, dataset1 has cell types A, B and C, dataset2 has cell types C and D. I would exclude cell type C from dataset2 before combining.

pigraul commented 3 years ago

hi，all，if you don’t mind, I hope to join your discussion. Regarding the third point, I see a related description in your article, but I have a question: If you exclude other data of the same cell type data, will it limit the data size of the training set and the richness of the data set. Then weakened the performance of the model. Is it possible to take the published method(such as Seurat) to remove the batch effect, and then use a large amount of data for training, because generally speaking, the larger the training set, the stronger the predictive ability of the model.

This is just my thoughts, commons are welcome.

Thanks !!!

jianhuupenn commented 3 years ago

Hi pigraul,

I think data size is not a big issue here. Hundreds of cells for each cell type are enough for training.

Regarding batch effect removal, we have 2 steps to do so:

Do normalization within each dataset before combining.
ItClust has an autoencoder structure, which is able to remove batch effect to some extend.

You can definitely use other methods to integrate data, but I do not recommend Seurat. Seurat uses CCA to remove batch effects between datasets, and to my experience, it is always over corrected. CCA also removes the difference between cell types when removing batch effect, which will hurt the downstream analysis, e.g. clustering.

bitcometz commented 3 years ago

hi, @jianhuupenn , thanks to your reply !!!

From the third point of reply above, for this software, it is not like other machine learning methods, and there is no need to collect a lot of data to make a training set.

Best

jianhuupenn / ItClust

Suggestions on train data construction #10