Open ys3508 opened 2 weeks ago
To add more background to my question. For the purpose of footprinting, rather than identifying TF motifs using all peaks, I want to focus specifically on differential peaks identified through differential analysis (using DiffBind). To achieve this, I used bedtools intersect to extract differential peaks from the alignment BAM files (let's call these the DE-only BAM files for clarity). I set a minimum overlap of 10%.
Initially, I used the DE-only BAM files for peak calling with MACS2. However, when I checked the resulting narrowPeak files, I noticed they had extremely high scores, q-values, and p-values, and the bias model training failed. To address this, I used bedtools intersect again, this time to extract differential peaks from the narrowPeak files (which were generated from peak calling on the original alignment BAM files), creating what I'll refer to as the DE-only narrowPeak files.
Here is the workflow:
Here's your text with some adjustments for clarity:
Based on this workflow, I have a few possible reasons for the issues encountered:
The DE-only narrowPeak files and the non-peak regions required for bias model training might be insufficient, leading to an empty data frame. The DE-only narrowPeak files are not directly generated from peak calling on the DE-only BAM files. Instead, I manually extracted the DE peaks from the original narrowPeak files. As a result, the DE-only narrowPeak files might not correspond accurately with the DE-only BAM files. Below is a table showing the number of peaks in the input files for both the DE-only BAM files and the DE-only narrowPeak files:
Please train the bias model as we have instructed. There is no need to train it based on DE peaks or their exclusion only. You can focus model training on whatever peaks u want. Bias model should be trained as we have instructed.
On Fri, Aug 30, 2024 at 10:29 AM ys3508 @.***> wrote:
To add more background to my question. For the purpose of footprinting, rather than identifying TF motifs using all peaks, I want to focus specifically on differential peaks identified through differential analysis (using DiffBind). To achieve this, I used bedtools intersect to extract differential peaks from the alignment BAM files (let's call these the DE-only BAM files for clarity). I set a minimum overlap of 10%.
Initially, I used the DE-only BAM files for peak calling with MACS2. However, when I checked the resulting narrowPeak files, I noticed they had extremely high scores, q-values, and p-values, and the bias model training failed. To address this, I used bedtools intersect again, this time to extract differential peaks from the narrowPeak files (which were generated from peak calling on the original alignment BAM files), creating what I'll refer to as the DE-only narrowPeak files.
Here is the workflow: Screen.Shot.2024-08-30.at.1.21.43.PM.png (view on web) https://github.com/user-attachments/assets/a93799c0-095e-4dba-a908-47403891b63c
Here's your text with some adjustments for clarity:
Based on this workflow, I have a few possible reasons for the issues encountered:
The DE-only narrowPeak files and the non-peak regions required for bias model training might be insufficient, leading to an empty data frame. The DE-only narrowPeak files are not directly generated from peak calling on the DE-only BAM files. Instead, I manually extracted the DE peaks from the original narrowPeak files. As a result, the DE-only narrowPeak files might not correspond accurately with the DE-only BAM files. Below is a table showing the number of peaks in the input files for both the DE-only BAM files and the DE-only narrowPeak files: Screen.Shot.2024-08-30.at.1.29.12.PM.png (view on web) https://github.com/user-attachments/assets/31b296bb-8b3e-4942-bcee-4385fb2fb9cb
— Reply to this email directly, view it on GitHub https://github.com/kundajelab/chrombpnet/issues/205#issuecomment-2322029785, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDWEIOKALMXYCQ6YQ34YTZUCTYZAVCNFSM6AAAAABNMZAZPOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRSGAZDSNZYGU . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Are you suggesting that I train the bias model using the original BAM file and narrowPeak files, and then apply this bias model to the DE-only narrowPeak and BAM files for bias-factorized ChromBPNet model training? Thank you for your quick response!
Yes correct
On Fri, Aug 30, 2024 at 10:54 AM ys3508 @.***> wrote:
Are you suggesting that I train the bias model using the original BAM file and narrowPeak files, and then apply this bias model to the DE-only narrowPeak and BAM files for bias-factorized ChromBPNet model training? Thank you for your quick response!
— Reply to this email directly, view it on GitHub https://github.com/kundajelab/chrombpnet/issues/205#issuecomment-2322064947, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDWELMBXDPOPYK2CVWPJLZUCWUZAVCNFSM6AAAAABNMZAZPOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRSGA3DIOJUG4 . You are receiving this because you commented.Message ID: @.***>
I implemented the advice you provided, but I encountered a different issue when training the bias-factorized ChromBPNet model. Could you offer any insights into what might be causing this problem?
Thank you, and I hope you have a great Labor Day weekend!
File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 51, in init self.crop_revcomp_data() File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 68, in crop_revcomp_data self.sampled_nonpeak_seqs, self.sampled_nonpeak_cts, self.sampled_nonpeak_coords = subsample_nonpeak_data(self.nonpeak_seqs, self.nonpeak_cts, self.nonpeak_coords, len(self.peak_seqs), self.negative_sampling_ratio) File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 15, in subsample_nonpeak_data nonpeak_indices_to_keep = np.random.choice(len(nonpeak_seqs), size=num_nonpeak_samples, replace=False) File "mtrand.pyx", line 965, in numpy.random.mtrand.RandomState.choice ValueError: Cannot take a larger sample than population when 'replace=False'
I also tried using DE-only BAM files, while keeping the narrowpeaks/background input (non-peak regions) derived from all peaks (peak calling on the original BAM files). However, I encountered the following error:
Traceback (most recent call last):
File "/install/chrombpnet/0.1.7/bin/chrombpnet", line 8, in <module>
sys.exit(main())
File "/chrombpnet/chrombpnet/CHROMBPNET.py", line 23, in main
pipelines.chrombpnet_train_pipeline(args)
File "/chrombpnet/chrombpnet/pipelines.py", line 37, in chrombpnet_train_pipeline
find_chrombpnet_hyperparams.main(args_copy)
File "/chrombpnet/chrombpnet/helpers/hyperparameters/find_chrombpnet_hyperparams.py", line 138, in main
assert(counts_loss_weight != 0)
AssertionError
The bias model training seems to be correct using the original BAM files and narrowpeaks/background input (without specifically extracting for DE peaks). I can proceed with using this model for bias-free modeling for original BAM files and narrowpeaks/background input. Only when I used DE-only narrowPeak/background and DE-only BAM files, I had errors.
Therefore, what will be a good practice for DE-only footprinting using ChromBpNet?
Thanks,
Best, Yeque
I need to mention, that after extracting DE peaks from BAM file and NarrowPeak file, I have extreme small amount of peaks in the narrowpeak file and non-peak region files. I believe this is the reason why I had the following error.
<html xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
DE-only BAM files | 253842 | 124847 | 1548469 | 1074761 | 536346 | 308790 -- | -- | -- | -- | -- | -- | -- DE-only narrowpeak files | 255 | 253 | 1199 | 1170 | 930 | 774 Input background files (non-peaks regions) (fold 1) | 442 | 438 | 2092 | 2041 | 1619 | 1358
I am working on ATAC-seq mice data and have developed a function to run bias model training across all folds and comparison groups. While testing the function with fold 1 and the group fed_vs_fasted_0hr, I encountered the following error (I attach the details in the end): IndexError: list index out of range
Debugging Steps: I checked the number of unique chromosomes in the input BAM file, narrowPeak files, and the background file (non-peak regions):
Input BAM file:
narrowPeak files:
Background file (non-peak regions):
fold_1.json: { "test": [ "chr1", "chr2", "chr3", "chr4" ], "valid": [ "chr5", "chr6", "chr7", "chr8" ], "train": [ "chrX", "chr10", "chr14", "chr9", "chr11", "chr13", "chr12", "chr15", "chr16", "chr17", "chrY", "chr18", "chr19" ] } Could you please help identify the cause of this error and suggest a possible solution?
Thank you!
Error: