Reproducing the analysis from Nature Biotech paper

apredeus commented 1 year ago

Hi,

Thank you for a really great tool and an exciting dataset! I'm trying to reproduce the analysis done in the Nature Biotech paper (https://www.nature.com/articles/s41587-022-01250-0). I've downloaded the processed data from Zenodo (https://zenodo.org/record/5504061), and also compiled schrom and successfully ran the small example provided with the code.

Now, I'm trying to re-do the analysis done in the paper, and have few questions:

from what I understand, to generate the model file used by scChromHMM, I need to run bulk ChromHMM. I've read the paper methods and see that ChromHMM was ran on "level 2 annotation" - thus, I think it was ran using same 12 states on the each cell type? It's also not clear how many cells were used for pseudobulk aggregation to generate input for the bulk ChromHMM. Also, which model file is used, since it seems like each cell type pseudobulk should generate its own model?
how exactly are anchor files generated from the integrated Seurat objects? The integrated RDS files don't seem to contain barcodes that look like E2L4_AGCGTATCACAGTCCG which I assume are scCut&Tag-pro barcodes - what am I missing? Also, what exactly is the numerical value in the 3rd column of the anchor file that's fed to scChromHMM?

Thank you in advance - any help would be much appreciated!

All the best,

-- Alex

k3yavi commented 1 year ago

Hi @apredeus ,

Thank you for your kind words and for reaching out.

To answer your first question, we don't run ChromHMM individually on each cell type. Instead, we ran ChromHMM in cellmarkfiletable mode by providing the list of cell-type specific bed files for each histone mark within a single run of ChromHMM. More information can be found here in the ChromHMM manual. As you have mentioned, we use all 'level 2 annotated' cells to pseudobulk within their cell-type annotation.

Secondly, the Anchor file is basically a tab-delimited file which is output of the function FindTransferAnchors with the following format: <Reference Barcode> <Query Barcode> <Mapping Score>.

The anchor file summarizes which cells in the query dataset map to which cells in the reference dataset and what's the mapping score (the third column). We use the mapping score to quantitatively interpolate fragment counts from every scCUT&Tag-pro experiment into the CITE-seq defined reference, and barcodes like E2L4_AGCGTATCACAGTCCG are from the reference CITE-seq experiment.

I hope this answers your questions, if not, feel free to reach out again.

apredeus commented 1 year ago

Hi Avi,

Thank you very much for your clarifications, I know what to do now to try and reproduce the analysis.

Last question: how did you decide on 12 states? The emission table seems to have 7 meaningful states and 5 near-empty ones - am I missing something here?

Thanks again for your help!

k3yavi commented 1 year ago

I am glad to hear that. Unfortunately, your last question is a bit tricky to answer as it's hard to know beforehand the number of hidden chromatin states. In my opinion, this is not specific to ChromHMM, it's an open problem, and there are different ways to heuristically solve the problem, for example, using AIC or BIC criteria, and even then it may lead to multiple near-empty ones as you mentioned.

Generally, if we quantify the posterior probability of near-empty states, they are trivially low and don't make much sense overall anyway. We use the states which are meaningfully quantifiable into different functional states. Ideally, again, in my opinion, the user should be able to specify the relevant combination of marks instead of learning unsupervised chromatin states (I am working on that for the next version) but I'd imagine when ChromHMM was designed we were not aware of relevant combinatorics of histone modifications and motivated the idea.

apredeus commented 1 year ago

Thank you for the detailed and thoughtful reply! It is indeed a frustrating thing to try and guess the number of states. I worked with (bulk) ChromHMM in the past, and it's totally a guessing game with lots of starting at the browser and some known regulatory elements, etc.

As for the relevant combinations, I think they were the original ENCODE support people, so they wanted to be as unbiased as possible, right? Few interesting and unusual states were discovered like that, but not much above the combinations that were already known.

At any rate, big thanks for entertaining my questions! Have a great holiday break.

apredeus commented 1 year ago

Hi @k3yavi - I was wondering if it would be possible to share the code used to generate BED files from pseudobulks (is there peak calling involved?) and run ChromHMM on the data used in the paper? We're not quite sure we're doing everything correctly. I know it's probably a bunch of miscellaneous scripts in few languages but anything would be helpful.

Gavin-Lijy commented 1 year ago

Yes, I have a related question here https://github.com/satijalab/scChromHMM/issues/4#issue-1611149599

k3yavi commented 1 year ago

Hi @apredeus and @Gavin-Lijy ,

Thank you for your patience, and I apologize for the late response during the interview season.

@Gavin-Lijy I'll reply to you on the https://github.com/satijalab/scChromHMM/issues/4#issue-1611149599 directly.

@apredeus Generating the BED files from pseudo bulks was quite straightforward. The general idea is to use the function FilterCells from the Signac package, something like the following.

mark <- "k4me1"
celltype <- "B"

obj <- readRDS("<path to mark object>")
keep.cells <- names(obj$celltype[obj$celltype == celltype])

frag <- GetFragmentData(object = Fragments(obj)[[1]], slot = "path")
FilterCells(
  fragments = frag,
  cells = keep.cells,
  outfile = "<output file path>",
  buffer_length = 256L,
  verbose = TRUE
)

satijalab / scChromHMM

Reproducing the analysis from Nature Biotech paper #3