satijalab / scChromHMM

🧬 🦀 A fast and efficient tool to perform a genome wide Single cell Chromatin State Analysis using multimodal histone modification data.
BSD 3-Clause "New" or "Revised" License
24 stars 5 forks source link

About scChromHMM for the interpolated multiple histone modifications in single cells #4

Open Gavin-Lijy opened 1 year ago

Gavin-Lijy commented 1 year ago

Hi, fantastic work! But I have a question about the input of the model. In your paper, you said The input to scChromHMM is the set of interpolated histone modifications in single cells, as described above., but in this repo, the input is only the anchor cells and seperated one-histone data. I wonder could you help us reproduce the paper by providing the scripts for the anchor-based imputation for all the 20,000 CITE-seq cells and the codes for them as input? Thank you so much!

Gavin-Lijy commented 1 year ago

After going through your codes roughly, I think I basically understand the way you do this. The interpolation part and the scChromHMM part are separately done, although the anchors are both used. The former gets 20,000 CITE-seq cells with multiple histone marks, but the latter only gets scChromHMM results for 7201 cells, which is the union of anchors for respective marks.

k3yavi commented 1 year ago

Hi @Gavin-Lijy ,

Thank you for your kind words and interest in our work.

Your observation is correct, i.e., the input to the method is only the anchor cells and individual histone data. The tool computes the interpolated counts internally, and the code base for that is https://github.com/satijalab/scChromHMM/blob/main/src/fragment.rs#L36-L80. This is intentional for the following reason:

  1. Size of the genome: scChromHMM performs analyses on all 200bp regions of the genome across 20k cells. The number of quantifiable regions is significantly high, especially after the interpolation, which is a significant burden for storing the data on the disk as it can easily be hundreds of gigabytes.
  2. Performance of the method: Even if we can store the data on disk, it's a huge computational burden for the method to load the data back into memory as disk access is much slower than in-memory operations. This optimization keeps the performance of the tool at a reasonable level.

I can certainly imagine the excitement of accessing the interpolated data, i.e., multiple histone modification marks for each cell. However, this version of the method is a proof-of-concept study, and not possible to write the interpolated counts as an intermediary file. Nevertheless, I am working on another version of the method here, which dumps the required files. It's relatively early days for the new method, but I'd let you know once we have a reasonable version.

Gavin-Lijy commented 1 year ago

And I think the reason why you don't use all the imputed 20,000 cells is that you need fragments files for forward–backward algorithm, however the information is lost when doing interpolation( we can only get the number of fragments for each histone modification, in each 200-bp genomic window). Is it right?

Gavin-Lijy commented 1 year ago

And I have only one question left: I am not familar with Rust, as far as I can tell, for each anchor cell in CITE-seq, you can only get anchors from 1 or 2 histone marks in average, that means, for each anchor cell in CITE-seq, you will get the scChromHMM assignment only based on 1 or 2 histone marks information, right?

Hi @Gavin-Lijy ,

Thank you for your kind words and interest in our work.

Your observation is correct, i.e., the input to the method is only the anchor cells and individual histone data. The tool computes the interpolated counts internally, and the code base for that is https://github.com/satijalab/scChromHMM/blob/main/src/fragment.rs#L36-L80. This is intentional for the following reason:

  1. Size of the genome: scChromHMM performs analyses on all 200bp regions of the genome across 20k cells. The number of quantifiable regions is significantly high, especially after the interpolation, which is a significant burden for storing the data on the disk as it can easily be hundreds of gigabytes.
  2. Performance of the method: Even if we can store the data on disk, it's a huge computational burden for the method to load the data back into memory as disk access is much slower than in-memory operations. This optimization keeps the performance of the tool at a reasonable level.

I can certainly imagine the excitement of accessing the interpolated data, i.e., multiple histone modification marks for each cell. However, this version of the method is a proof-of-concept study, and not possible to write the interpolated counts as an intermediary file. Nevertheless, I am working on another version of the method here, which dumps the required files. It's relatively early days for the new method, but I'd let you know once we have a reasonable version.

Ok, I didn't notice you compute the interpolated counts internally! That's perfect, thank you!

Gavin-Lijy commented 1 year ago

Nevertheless, I am working on another version of the method here, which dumps the required files. It's relatively early days for the new method, but I'd let you know once we have a reasonable version.

I can't access it; perhaps it's private. Anyway, I really appreciate your thorough explanation and am thrilled to hear that you are still working to maintain and improve the method. Thank you so much, it will pay off!

k3yavi commented 1 year ago

I am not sure if I understand your follow-up question correctly, but just to clarify, we interpolate counts for all 20k CITE-seq cells. The data is sparse, you are correct in that regard, however, we counter that by finding a generally high number of anchors ~500 (not the default of FindAnchors) between query and reference data. There will certainly be a bunch of low-quality alignments, but our reference mapping framework allows for weighing the interpolation method based on the score of each cell-to-cell alignment.

I hope it helps, please feel free to reach out if you have any other questions.

Gavin-Lijy commented 1 year ago

I am not sure if I understand your follow-up question correctly, but just to clarify, we interpolate counts for all 20k CITE-seq cells. The data is sparse, you are correct in that regard, however, we counter that by finding a generally high number of anchors ~500 (not the default of FindAnchors) between query and reference data. There will certainly be a bunch of low-quality alignments, but our reference mapping framework allows for weighing the interpolation method based on the score of each cell-to-cell alignment.

I hope it helps, please feel free to reach out if you have any other questions. I think the interpolation for the 20,000 cells and the 7201 cells used for scChromHMM should be discussed separately. As far as I understand,

  1. Are the anchors shared between a and b processes? If not, what are they respectively?
  2. In the imputation of scChromHMM, how many histone mark types could you get? Because I think for each anchor in your CITE-seq data, you only have 1 or 2 anchors on average. Given that, how could you get the information on other histone marks where there are no anchors found for the CITE-seq cell? Or you can get the chromatin states just based on 1 or 2 histone information?
k3yavi commented 1 year ago

Ok, here's a counterquestion that might help you understand better. Once you have interpolated histone mark counts across all 20k cells (i.e., part a in your above schema), why do you need to perform imputation again for scChromHMM (i.e., part b above)? I don't think we perform part b, as you mentioned above.

Broadly speaking, scChromHMM has three parts (1) assemble, (2) learn (3) predict.

The (1) part of assembling is what you are describing above, i.e., if you are given individual marks and a mapping file between the profiled cells in the histone mark and the cells in the CITE-seq reference, you interpolate/impute counts from cells in the marks to the 20k cells on the CITE-seq reference.

(2) Parallelly, we run bulk ChromHMM to learn the hidden chromatin states.

(3) Predict: In your explanation above, I think this is what you are calling (b). As long as we have the interpolated counts for the 20k CITE-seq cells, we don't need to interpolate again and can run the forward-backward algorithm directly on these interpolated counts.

To answer your questions: (1) Anchor process is upstream of all three steps I explained above, i.e., performed using the FindTransferAnchor function of Seurat. Anchors are only used for Assemble phase of scChromHMM and not for predicting the posterior probabilities. Now if you look directly at the code base, I've blurred the boundaries between Assemble and Predict stages of the algorithm to optimize the implementation, which I think is the point of confusion for you. (2) question needs to be clarified for me. We don't perform imputation again on the Predict stage. If the question is for Assemble stage, then like I mentioned before, we generally find some anchors for most of the CITE-seq cells as we increase the possible number of anchors and weight them based on the score.

Gavin-Lijy commented 1 year ago

The (1) part of assembling is what you are describing above, i.e., if you are given individual marks and a mapping file between the profiled cells in the histone mark and the cells in the CITE-seq reference, you interpolate/impute counts from cells in the marks to the 20k cells on the CITE-seq reference.

I guess I found my point of confusion. As you said, you only perform imputation once for all the 20k cells, which is embedded in the assembling part in this repo, but I don't think you can do it. How could you impute the histone mark counts of all the 20k cells only based on these input: individual marks and anchor files? You don't even know anything about any cells besides the anchor cells. And if you can get the imputation results for all the 20k cells here, why do you only get the states for the 7201 anchor cells?

k3yavi commented 1 year ago

I need help understanding where is the 7201 number coming from? Are you talking about the example folder in the Github repo? If so, we intentionally put a trimmed version of the data on Github to be able to regenerate a toy example of the workflow.

Gavin-Lijy commented 1 year ago

I need help understanding where is the 7201 number coming from? Are you talking about the example folder in the Github repo? If so, we intentionally put a trimmed version of the data on Github to be able to regenerate a toy example of the workflow.

Yes, so this repo doesn't include the code for 20K cells imputation?

k3yavi commented 1 year ago

Well, code is always separate from data. This code can be run on different datasets, not specific only to the data we generated.

The datasets in the example folder of this repo, like any other software repo, is to generate a toy example for others to replicate and understand the format of the input data required by the software.

Gavin-Lijy commented 1 year ago

Well, code is always separate from data. This code can be run on different datasets, not specific only to the data we generated.

The datasets in the example folder of this repo, like any other software repo, is to generate a toy example for others to replicate and understand the format of the input data required by the software.

But I still don't understand how you do imputation here. I think you need the protein information to compute the distance of each cell c and anchor i and then apply a Gaussian kernel to get the weighted information of the anchors, as described in Anchor weighting and Feature Imputation part https://www.sciencedirect.com/science/article/pii/S0092867419305598?via%3Dihub, as you declared here. We used these anchors to interpolate (i.e., ‘impute’) the number of fragments for each histone modification, in each 200-bp genomic window, at single-cell resolution (see Stuart et al.[38](https://www.nature.com/articles/s41587-022-01250-0#ref-CR38), ‘Feature Imputation’)