Questions about training bias models

DongzeHE commented 1 year ago

Hi,

Yup it's me again ;P.

I am processing a series of 10x single-cell ATAC-seq datasets from the same tissue but with different genotypes.

Regarding to ChromBPNet bias model training, would you mind me ask a few questions that I am uncertain about?

Do you have any pretrained 10x scATAC-seq bias model available for downloading?
I tried to use a fragment BED file from one of the experiments I have, which is of size 5 .7GB, to train a bias model. It turned out that ChromBPNet first converted this file into BigWig format and made BedGraph, which took many hours to run. I am wondering if it is OK to use only a small random subsample of the fragments to train the bias model, so that the time used for converting huge BED files can decrease? If this is possible, how many fragments should I select for training a decent bias model?
If I want to train one ChromBPNet model for each (genotype, cell type) group in my single-cell experiments, so as to capture the genotype/cell type specific motifs, what should be the resolution to train the bias model: should I train a bias mode for each cell type, each single-cell sample (one run on the machine or one experiment containing cells from multiple cell types), or just one bias model for all samples (all runs from the machine or the whole series of experiments)? If one for all samples, is this OK to select a small subset of fragments from each of the experiments? Will that cause any problems as they are from different experiments(domains I guess)?

Thanks in advance!

Best, Dongze

DongzeHE commented 1 year ago

I tried to train a bias model using a subset of my fragments. However I encountered this error. I think the problem here is the peak files and the fragment file have to match? Is this possible to skip those empty regions, or at least check if the list is empty before indexing?

Traceback (most recent call last):
  File "/miniconda3/envs/chrombpnet/bin/chrombpnet", line 8, in <module>
    sys.exit(main())
  File "/miniconda3/envs/chrombpnet/lib/python3.8/site-packages/chrombpnet/CHROMBPNET.py", line 38, in main
    pipelines.train_bias_pipeline(args)
  File "/miniconda3/envs/chrombpnet/lib/python3.8/site-packages/chrombpnet/pipelines.py", line 306, in train_bias_pipeline
    train.main(args_copy)
  File "/miniconda3/envs/chrombpnet/lib/python3.8/site-packages/chrombpnet/training/train.py", line 87, in main
    train_generator = initializers.initialize_generators(args, "train", parameters, return_coords=False)
  File "/miniconda3/envs/chrombpnet/lib/python3.8/site-packages/chrombpnet/training/data_generators/initializers.py", line 80, in initialize_generators
    generator=batchgen_generator.ChromBPNetBatchGenerator(
  File "/miniconda3/envs/chrombpnet/lib/python3.8/site-packages/chrombpnet/training/data_generators/batchgen_generator.py", line 36, in __init__
    peak_seqs, peak_cts, peak_coords, nonpeak_seqs, nonpeak_cts, nonpeak_coords, = data_utils.load_data(peak_regions, nonpeak_regions, genome_fasta, cts_bw_file, inputlen, outputlen, max_jitter)
  File "/miniconda3/envs/chrombpnet/lib/python3.8/site-packages/chrombpnet/training/utils/data_utils.py", line 89, in load_data
    train_nonpeaks_seqs, train_nonpeaks_cts, train_nonpeaks_coords = get_seq_cts_coords(nonpeak_regions,
  File "/miniconda3/envs/chrombpnet/lib/python3.8/site-packages/chrombpnet/training/utils/data_utils.py", line 52, in get_seq_cts_coords
    seq = get_seq(peaks_df, genome, input_width)
  File "/miniconda3/envs/chrombpnet/lib/python3.8/site-packages/chrombpnet/training/utils/data_utils.py", line 18, in get_seq
    return one_hot.dna_to_one_hot(vals)
  File "/miniconda3/envs/chrombpnet/lib/python3.8/site-packages/chrombpnet/training/utils/one_hot.py", line 19, in dna_to_one_hot
    seq_len = len(seqs[0])
IndexError: list index out of range

DongzeHE commented 1 year ago

After spending a whole day on this, I realized that the error was caused by the outlier_threshold parameter around this line : after filtering using this parameter, the nonpeaks dataframe can be empty. As there is no such a check, the empty df will be returned and be read in the next step, which causes the OOB indexing error. I suggest to have a check after the filtration.

panushri25 commented 1 year ago

Hello @DongzeHE ,

(1) I will get back to you on the availability of this bias model.

(2) I wouldn't recommend subsampling the fragments. How many fragments do you have in the original file? In your case the bound cut-off being stringent happened because of the subsampling, which wouldn't have happened otherwise.

panushri25 commented 1 year ago

Also what are the specs of the machine you are using ? Can you try this branch and tell me If this is any faster? - https://github.com/kundajelab/chrombpnet/tree/suragnair-patch-1

panushri25 commented 1 year ago

There are many issues with subsampling and using the same peaks and non-peaks, so I wouldn't recommend doing that, but lets look into why it is taking so many hours. If you can provide the information I asked above maybe I can help.

panushri25 commented 1 year ago

Hello @DongzeHE

You can use the 10x single ATAC-seq bias model here - https://storage.googleapis.com/chrombpnet_data/input_files/bias_models/ATAC/scATAC_dermal_fibroblast.h5

jasondanic commented 1 year ago

Hello @DongzeHE

You can use the 10x single ATAC-seq bias model here - https://storage.googleapis.com/chrombpnet_data/input_files/bias_models/ATAC/scATAC_dermal_fibroblast.h5

Hi， @panushri25

Wondering whether the bias model you provided above is specific to some certain species like mouse?

Or, are bias models applicable across different species (eg, hg38 and mm10) in the 10x scATAC-seq data? Because I would like to try chrombpnet pipeline on a 10x scATAC-seq data from human samples.

vitkl commented 1 year ago

Tutorials say that the bias models are correcting not just Tn5 intrinsic sequence preference bias but also other ATAC-seq technical effects such as PCR bias (GC content). This raises the question - what is the most optimal way to train bias models and is it possible to create a universal set of background sequences?

In some way, what you do here is similar to CellBender model for scRNA background abundance correction. CellBender uses data from 10x droplets that don't contain cells to derive the free-floating background RNA profile. That bias must be estimated for every single 10x reaction/inlets.

Here you use DNA sequences likely to contain no biological information to correct any differences observed in such non-peak regions. Intuitively, the strength of PCR bias [and possibly other detection issues] can vary between 10x reactions.

Should bias models be similarly trained for every 10x reactions? What do you think in general?

akundaje commented 1 year ago

Bias models often transfer across experiments but also we find many cases where they need to be retrained when there are strong experiment specific effects.

Anshul

On Fri, Jul 28, 2023, 6:52 PM Vitalii Kleshchevnikov < @.***> wrote:

Tutorials say that the bias models are correcting not just Tn5 intrinsic sequence preference bias but also other ATAC-seq technical effects such as PCR bias (GC content). This raises the question - what is the most optimal way to train bias models?

In some way, what you do here is similar to CellBender model for scRNA background abundance correction. CellBender uses data from 10x droplets that don't contain cells to derive the RNA free-floating background profile. That bias must be estimated for every single 10x reaction/inlets.

Here you use DNA sequences likely to contain no biological information to correct any differences observed in such non-peak regions. Intuitively, the strength of PCR bias [and possibly other detection issues] can vary between 10x reactions.

Should bias models be similarly trained for every 10x reactions? What do you think in general?

— Reply to this email directly, view it on GitHub https://github.com/kundajelab/chrombpnet/issues/121#issuecomment-1656415676, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDWEMY2CJ5TXINOYZWRF3XSQ7C7ANCNFSM6AAAAAA2QOKCMM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

vitkl commented 1 year ago

Is it possible to create a universal set of background sequences or should the background sequences similarly be batch/reaction-specific (peak calling per reaction)?

panushri25 commented 1 year ago

Hello @vitkl

So the requirement is that the background sequences should match the GC-contentedness of the representative peaks. As long as you are able to sample this, the bias model should transfer.

We did try to train a universal bias model across several cell types. What ended up happening was that removing all the peak regions pooled across these varied cell-types resulted in removing a large part of backgrounds that are potentially GC-rich. And the bias model ended up learning AT-rich bias and failing to transfer.

So it depends on what universal here means - if all your reactions have many overlapping peaks eliminating these regions will not eliminate a large part of the background with similar GC-content. So you will still have high chance of finding a background with a good GC-match for your peaks.

Hope this helps!

Thank you, Anu

panushri25 commented 1 year ago

@jasondanic

The link provided is for a bias model trained on pseudo-bulked single-cell human dermal fibroblasts data.

vitkl commented 1 year ago

Hi @panushri25

Thanks for explaining. I am looking at the cell atlases of embryonic development that cover all tissues/cell lineages at a given developmental stage. For example, mouse gastrulation atlas https://www.biorxiv.org/content/10.1101/2022.06.15.496239v2.abstract contains ~30 major populations and ~70 fine-grained regionally distinct subtypes - including multiple populations of the following linages: mesoderm (incl heart), endoderm (incl hematopoietic linage), ectoderm, neuronal and early developmental (epiblast, other E7.5 cells). Do you suggest defining a different background model for every broad cell lineage (eg ectoderm), every population (eg forebrain neuron progenitors) or every population * 10x batch?

panushri25 commented 1 year ago

@vitkl Sorry it took me a while to get back to this. I see that you opened a new issue, is your question answered?

panushri25 commented 1 year ago

Please feel free to open this, if you have more questions.

vitkl commented 1 year ago

Do you suggest defining a different background model for every broad cell lineage (eg ectoderm), every population (eg forebrain neuron progenitors) or every population * 10x batch?

This question is still open. How should the baseline model be defined - based on aggregated fragments file for all experiments or a different model per cell type? Should the negative regions be similarly defined as identical for all cell types (eg peak calling based on all experiments) or per cell type?

Based on your responses, I don't fully understand whats the optimal strategy here.

panushri25 commented 1 year ago

You are eventually interested in studying the TF dynamics per cell type (I,e you will train a ChromBPNet model per every cell-type) correct?

vitkl commented 1 year ago

You are eventually interested in studying the TF dynamics per cell type

Yes exactly, also in comparing cell types similarly to your reprogramming work. The question is what would be the correct bias models.

akundaje commented 1 year ago

If all the cell types are coming from the same single cell experiment or a collection of experiments done by the same person with very similar protocols for sample prep, library prep and sequencing a single bias model trained on a deep cell type will usually generalize very well.

So that is the first thing I would try. You can easily test if the bias correction is working as expected using the diagnostic strategies in the ChromBPNet tutorials and reports.

For cell types/samples where it fails, you can train a custom bias model from that set.

On Sat, Aug 12, 2023, 12:09 AM Vitalii Kleshchevnikov < @.***> wrote:

You are eventually interested in studying the TF dynamics per cell type

Yes exactly, also in comparing cell types similarly to your reprogramming work. The question is what would be the correct bias models.

— Reply to this email directly, view it on GitHub https://github.com/kundajelab/chrombpnet/issues/121#issuecomment-1675675824, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDWEKATJAL6QUSMCS72XDXU36YJANCNFSM6AAAAAA2QOKCMM . You are receiving this because you commented.Message ID: @.***>

vitkl commented 1 year ago

This is quite instructive, thanks for clarifying!

kundajelab / chrombpnet

Questions about training bias models #121