Closed DongzeHE closed 1 year ago
I tried to train a bias model using a subset of my fragments. However I encountered this error. I think the problem here is the peak files and the fragment file have to match? Is this possible to skip those empty regions, or at least check if the list is empty before indexing?
Traceback (most recent call last):
File "/miniconda3/envs/chrombpnet/bin/chrombpnet", line 8, in <module>
sys.exit(main())
File "/miniconda3/envs/chrombpnet/lib/python3.8/site-packages/chrombpnet/CHROMBPNET.py", line 38, in main
pipelines.train_bias_pipeline(args)
File "/miniconda3/envs/chrombpnet/lib/python3.8/site-packages/chrombpnet/pipelines.py", line 306, in train_bias_pipeline
train.main(args_copy)
File "/miniconda3/envs/chrombpnet/lib/python3.8/site-packages/chrombpnet/training/train.py", line 87, in main
train_generator = initializers.initialize_generators(args, "train", parameters, return_coords=False)
File "/miniconda3/envs/chrombpnet/lib/python3.8/site-packages/chrombpnet/training/data_generators/initializers.py", line 80, in initialize_generators
generator=batchgen_generator.ChromBPNetBatchGenerator(
File "/miniconda3/envs/chrombpnet/lib/python3.8/site-packages/chrombpnet/training/data_generators/batchgen_generator.py", line 36, in __init__
peak_seqs, peak_cts, peak_coords, nonpeak_seqs, nonpeak_cts, nonpeak_coords, = data_utils.load_data(peak_regions, nonpeak_regions, genome_fasta, cts_bw_file, inputlen, outputlen, max_jitter)
File "/miniconda3/envs/chrombpnet/lib/python3.8/site-packages/chrombpnet/training/utils/data_utils.py", line 89, in load_data
train_nonpeaks_seqs, train_nonpeaks_cts, train_nonpeaks_coords = get_seq_cts_coords(nonpeak_regions,
File "/miniconda3/envs/chrombpnet/lib/python3.8/site-packages/chrombpnet/training/utils/data_utils.py", line 52, in get_seq_cts_coords
seq = get_seq(peaks_df, genome, input_width)
File "/miniconda3/envs/chrombpnet/lib/python3.8/site-packages/chrombpnet/training/utils/data_utils.py", line 18, in get_seq
return one_hot.dna_to_one_hot(vals)
File "/miniconda3/envs/chrombpnet/lib/python3.8/site-packages/chrombpnet/training/utils/one_hot.py", line 19, in dna_to_one_hot
seq_len = len(seqs[0])
IndexError: list index out of range
After spending a whole day on this, I realized that the error was caused by the outlier_threshold
parameter around this line : after filtering using this parameter, the nonpeaks
dataframe can be empty. As there is no such a check, the empty df will be returned and be read in the next step, which causes the OOB indexing error. I suggest to have a check after the filtration.
Hello @DongzeHE ,
(1) I will get back to you on the availability of this bias model.
(2) I wouldn't recommend subsampling the fragments. How many fragments do you have in the original file? In your case the bound cut-off being stringent happened because of the subsampling, which wouldn't have happened otherwise.
Also what are the specs of the machine you are using ? Can you try this branch and tell me If this is any faster? - https://github.com/kundajelab/chrombpnet/tree/suragnair-patch-1
There are many issues with subsampling and using the same peaks and non-peaks, so I wouldn't recommend doing that, but lets look into why it is taking so many hours. If you can provide the information I asked above maybe I can help.
Hello @DongzeHE
You can use the 10x single ATAC-seq bias model here - https://storage.googleapis.com/chrombpnet_data/input_files/bias_models/ATAC/scATAC_dermal_fibroblast.h5
Hello @DongzeHE
You can use the 10x single ATAC-seq bias model here - https://storage.googleapis.com/chrombpnet_data/input_files/bias_models/ATAC/scATAC_dermal_fibroblast.h5
Hi, @panushri25
Wondering whether the bias model you provided above is specific to some certain species like mouse?
Or, are bias models applicable across different species (eg, hg38 and mm10) in the 10x scATAC-seq data? Because I would like to try chrombpnet pipeline on a 10x scATAC-seq data from human samples.
Tutorials say that the bias models are correcting not just Tn5 intrinsic sequence preference bias but also other ATAC-seq technical effects such as PCR bias (GC content). This raises the question - what is the most optimal way to train bias models and is it possible to create a universal set of background sequences?
In some way, what you do here is similar to CellBender model for scRNA background abundance correction. CellBender uses data from 10x droplets that don't contain cells to derive the free-floating background RNA profile. That bias must be estimated for every single 10x reaction/inlets.
Here you use DNA sequences likely to contain no biological information to correct any differences observed in such non-peak regions. Intuitively, the strength of PCR bias [and possibly other detection issues] can vary between 10x reactions.
Should bias models be similarly trained for every 10x reactions? What do you think in general?
Bias models often transfer across experiments but also we find many cases where they need to be retrained when there are strong experiment specific effects.
Anshul
On Fri, Jul 28, 2023, 6:52 PM Vitalii Kleshchevnikov < @.***> wrote:
Tutorials say that the bias models are correcting not just Tn5 intrinsic sequence preference bias but also other ATAC-seq technical effects such as PCR bias (GC content). This raises the question - what is the most optimal way to train bias models?
In some way, what you do here is similar to CellBender model for scRNA background abundance correction. CellBender uses data from 10x droplets that don't contain cells to derive the RNA free-floating background profile. That bias must be estimated for every single 10x reaction/inlets.
Here you use DNA sequences likely to contain no biological information to correct any differences observed in such non-peak regions. Intuitively, the strength of PCR bias [and possibly other detection issues] can vary between 10x reactions.
Should bias models be similarly trained for every 10x reactions? What do you think in general?
— Reply to this email directly, view it on GitHub https://github.com/kundajelab/chrombpnet/issues/121#issuecomment-1656415676, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDWEMY2CJ5TXINOYZWRF3XSQ7C7ANCNFSM6AAAAAA2QOKCMM . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Is it possible to create a universal set of background sequences or should the background sequences similarly be batch/reaction-specific (peak calling per reaction)?
Hello @vitkl
So the requirement is that the background sequences should match the GC-contentedness of the representative peaks. As long as you are able to sample this, the bias model should transfer.
We did try to train a universal bias model across several cell types. What ended up happening was that removing all the peak regions pooled across these varied cell-types resulted in removing a large part of backgrounds that are potentially GC-rich. And the bias model ended up learning AT-rich bias and failing to transfer.
So it depends on what universal here means - if all your reactions have many overlapping peaks eliminating these regions will not eliminate a large part of the background with similar GC-content. So you will still have high chance of finding a background with a good GC-match for your peaks.
Hope this helps!
Thank you, Anu
@jasondanic
The link provided is for a bias model trained on pseudo-bulked single-cell human dermal fibroblasts data.
Hi @panushri25
Thanks for explaining. I am looking at the cell atlases of embryonic development that cover all tissues/cell lineages at a given developmental stage. For example, mouse gastrulation atlas https://www.biorxiv.org/content/10.1101/2022.06.15.496239v2.abstract contains ~30 major populations and ~70 fine-grained regionally distinct subtypes - including multiple populations of the following linages: mesoderm (incl heart), endoderm (incl hematopoietic linage), ectoderm, neuronal and early developmental (epiblast, other E7.5 cells). Do you suggest defining a different background model for every broad cell lineage (eg ectoderm), every population (eg forebrain neuron progenitors) or every population * 10x batch?
@vitkl Sorry it took me a while to get back to this. I see that you opened a new issue, is your question answered?
Please feel free to open this, if you have more questions.
Do you suggest defining a different background model for every broad cell lineage (eg ectoderm), every population (eg forebrain neuron progenitors) or every population * 10x batch?
This question is still open. How should the baseline model be defined - based on aggregated fragments file for all experiments or a different model per cell type? Should the negative regions be similarly defined as identical for all cell types (eg peak calling based on all experiments) or per cell type?
Based on your responses, I don't fully understand whats the optimal strategy here.
You are eventually interested in studying the TF dynamics per cell type (I,e you will train a ChromBPNet model per every cell-type) correct?
You are eventually interested in studying the TF dynamics per cell type
Yes exactly, also in comparing cell types similarly to your reprogramming work. The question is what would be the correct bias models.
If all the cell types are coming from the same single cell experiment or a collection of experiments done by the same person with very similar protocols for sample prep, library prep and sequencing a single bias model trained on a deep cell type will usually generalize very well.
So that is the first thing I would try. You can easily test if the bias correction is working as expected using the diagnostic strategies in the ChromBPNet tutorials and reports.
For cell types/samples where it fails, you can train a custom bias model from that set.
On Sat, Aug 12, 2023, 12:09 AM Vitalii Kleshchevnikov < @.***> wrote:
You are eventually interested in studying the TF dynamics per cell type
Yes exactly, also in comparing cell types similarly to your reprogramming work. The question is what would be the correct bias models.
— Reply to this email directly, view it on GitHub https://github.com/kundajelab/chrombpnet/issues/121#issuecomment-1675675824, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDWEKATJAL6QUSMCS72XDXU36YJANCNFSM6AAAAAA2QOKCMM . You are receiving this because you commented.Message ID: @.***>
This is quite instructive, thanks for clarifying!
Hi,
Yup it's me again ;P.
I am processing a series of 10x single-cell ATAC-seq datasets from the same tissue but with different genotypes.
Regarding to ChromBPNet bias model training, would you mind me ask a few questions that I am uncertain about?
Thanks in advance!
Best, Dongze