kundajelab / chrombpnet

Bias factorized, base-resolution deep learning models of chromatin accessibility (chromBPNet)
https://github.com/kundajelab/chrombpnet/wiki
MIT License
112 stars 30 forks source link

Error Encountered During Bias Model Training for ATAC-seq Data #205

Open ys3508 opened 2 weeks ago

ys3508 commented 2 weeks ago

I am working on ATAC-seq mice data and have developed a function to run bias model training across all folds and comparison groups. While testing the function with fold 1 and the group fed_vs_fasted_0hr, I encountered the following error (I attach the details in the end): IndexError: list index out of range

Debugging Steps: I checked the number of unique chromosomes in the input BAM file, narrowPeak files, and the background file (non-peak regions):

Input BAM file:

samtools view 0hr_Fasted_trimmomatic_intersect_DE_sort.bam | cut -f 3 | sort | uniq | wc -l Result: 20

narrowPeak files:

cut -f1 0hr_Fasted_trimmomatic_peaks_no_blacklist.narrowPeak | sort | uniq | wc -l Result: 20

Background file (non-peak regions):

cut -f1 0hr_Fasted_trimmomatic_fold_1_negatives.bed | sort | uniq | wc -l Result: 21

fold_1.json: { "test": [ "chr1", "chr2", "chr3", "chr4" ], "valid": [ "chr5", "chr6", "chr7", "chr8" ], "train": [ "chrX", "chr10", "chr14", "chr9", "chr11", "chr13", "chr12", "chr15", "chr16", "chr17", "chrY", "chr18", "chr19" ] } Could you please help identify the cause of this error and suggest a possible solution?

Thank you!

Error:

  • 2024-08-30 11:56:30.285465: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
  • To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
  • 2024-08-30 11:56:30.841725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 43250 MB memory: -> device: 0, name: NVIDIA L40S, pci bus id: 0000:4a:00.0, compute capability: 8.9
  • /.local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3432: RuntimeWarning: Mean of empty slice.
  • return _methods._mean(a, axis=axis, dtype=dtype,
  • /.local/lib/python3.8/site-packages/numpy/core/_methods.py:190: RuntimeWarning: invalid value encountered in double_scalars
  • ret = ret.dtype.type(ret / rcount)
  • Estimating enzyme shift in input file
  • Current estimated shift: +4/-5
  • awk -v OFS="\t" '{if ($6=="+"){print $1,$2+0,$3,$4,$5,$6} else if ($6=="-") {print $1,$2,$3+1,$4,$5,$6}}' | sort -k1,1 | bedtools genomecov -bg -5 -i stdin -g /chrombpnet/mm10.chrom.subset.sizes | LC_COLLATE="C" sort -k1,1 -k2,2n
  • Making BedGraph (Filter chromosomes not in reference fasta)
  • Making Bigwig
  • non zero bigwig entries in the given chromosome: 9774
  • evaluating hyperparameters on the following chromosomes ['chrX', 'chr10', 'chr14', 'chr9', 'chr11', 'chr13', 'chr12', 'chr15', 'chr16', 'chr17', 'chrY', 'chr18', 'chr19', 'chr5', 'chr6', 'chr7', 'chr8']
  • Number of non peaks input: 1786
  • Number of non peaks filtered because the input/output is on the edge: 17
  • Number of non peaks being used: 1769
  • Number of non peaks input: 306
  • Number of non peaks filtered because the input/output is on the edge: 0
  • Number of non peaks being used: 306
  • Number of peaks input: 893
  • Number of peaks filtered because the input/output is on the edge: 0
  • Number of peaks being used: 893
  • Number of peaks input: 306
  • Number of peaks filtered because the input/output is on the edge: 0
  • Number of peaks being used: 306
  • Upper bound counts cut-off for bias model training: 195.408
  • Number of nonpeaks after the upper-bount cut-off: 1768
  • Number of nonpeaks after applying upper-bound cut-off and removing outliers : 0
  • counts_loss_weight: nan
  • {'counts_loss_weight': 'nan', 'filters': '128', 'n_dil_layers': '4', 'inputlen': '2114', 'outputlen': '1000', 'max_jitter': '0', 'chr_fold_path': '/chrombpnet/splits/fold_1.json', 'negative_sampling_ratio': '1.0'}
  • params:
  • filters:128
  • n_dil_layers:4
  • conv1_kernel_size:21
  • profile_kernel_size:75
  • counts_loss_weight:nan
  • got the model
  • loading nonpeaks...
  • got split:train for bed regions:(0, 10)
  • Traceback (most recent call last):
  • File "/ihome/crc/install/chrombpnet/0.1.7/bin/chrombpnet", line 8, in
  • sys.exit(main())
  • File "/chrombpnet/chrombpnet/CHROMBPNET.py", line 38, in main
  • pipelines.train_bias_pipeline(args)
  • File "/chrombpnet/chrombpnet/pipelines.py", line 311, in train_bias_pipeline
  • train.main(args_copy)
  • File "/chrombpnet/chrombpnet/training/train.py", line 87, in main
  • train_generator = initializers.initialize_generators(args, "train", parameters, return_coords=False)
  • File "/chrombpnet/chrombpnet/training/data_generators/initializers.py", line 80, in initialize_generators
  • generator=batchgen_generator.ChromBPNetBatchGenerator(
  • File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 36, in init
  • peak_seqs, peak_cts, peak_coords, nonpeak_seqs, nonpeak_cts, nonpeak_coords, = data_utils.load_data(peak_regions, nonpeak_regions, genome_fasta, cts_bw_file, inputlen, outputlen, max_jitter)
  • File "/chrombpnet/chrombpnet/training/utils/data_utils.py", line 87, in load_data
  • train_nonpeaks_seqs, train_nonpeaks_cts, train_nonpeaks_coords = get_seq_cts_coords(nonpeak_regions,
  • File "/chrombpnet/chrombpnet/training/utils/data_utils.py", line 50, in get_seq_cts_coords
  • seq = get_seq(peaks_df, genome, input_width)
  • File "/chrombpnet/chrombpnet/training/utils/data_utils.py", line 18, in get_seq
  • return one_hot.dna_to_one_hot(vals)
  • File "/chrombpnet/chrombpnet/training/utils/one_hot.py", line 18, in dna_to_one_hot
  • seq_len = len(seqs[0])
  • IndexError: list index out of range
ys3508 commented 2 weeks ago

To add more background to my question. For the purpose of footprinting, rather than identifying TF motifs using all peaks, I want to focus specifically on differential peaks identified through differential analysis (using DiffBind). To achieve this, I used bedtools intersect to extract differential peaks from the alignment BAM files (let's call these the DE-only BAM files for clarity). I set a minimum overlap of 10%.

Initially, I used the DE-only BAM files for peak calling with MACS2. However, when I checked the resulting narrowPeak files, I noticed they had extremely high scores, q-values, and p-values, and the bias model training failed. To address this, I used bedtools intersect again, this time to extract differential peaks from the narrowPeak files (which were generated from peak calling on the original alignment BAM files), creating what I'll refer to as the DE-only narrowPeak files.

Here is the workflow:

Screen Shot 2024-08-30 at 1 21 43 PM

Here's your text with some adjustments for clarity:

Based on this workflow, I have a few possible reasons for the issues encountered:

The DE-only narrowPeak files and the non-peak regions required for bias model training might be insufficient, leading to an empty data frame. The DE-only narrowPeak files are not directly generated from peak calling on the DE-only BAM files. Instead, I manually extracted the DE peaks from the original narrowPeak files. As a result, the DE-only narrowPeak files might not correspond accurately with the DE-only BAM files. Below is a table showing the number of peaks in the input files for both the DE-only BAM files and the DE-only narrowPeak files:

Screen Shot 2024-08-30 at 1 29 12 PM
akundaje commented 2 weeks ago

Please train the bias model as we have instructed. There is no need to train it based on DE peaks or their exclusion only. You can focus model training on whatever peaks u want. Bias model should be trained as we have instructed.

On Fri, Aug 30, 2024 at 10:29 AM ys3508 @.***> wrote:

To add more background to my question. For the purpose of footprinting, rather than identifying TF motifs using all peaks, I want to focus specifically on differential peaks identified through differential analysis (using DiffBind). To achieve this, I used bedtools intersect to extract differential peaks from the alignment BAM files (let's call these the DE-only BAM files for clarity). I set a minimum overlap of 10%.

Initially, I used the DE-only BAM files for peak calling with MACS2. However, when I checked the resulting narrowPeak files, I noticed they had extremely high scores, q-values, and p-values, and the bias model training failed. To address this, I used bedtools intersect again, this time to extract differential peaks from the narrowPeak files (which were generated from peak calling on the original alignment BAM files), creating what I'll refer to as the DE-only narrowPeak files.

Here is the workflow: Screen.Shot.2024-08-30.at.1.21.43.PM.png (view on web) https://github.com/user-attachments/assets/a93799c0-095e-4dba-a908-47403891b63c

Here's your text with some adjustments for clarity:

Based on this workflow, I have a few possible reasons for the issues encountered:

The DE-only narrowPeak files and the non-peak regions required for bias model training might be insufficient, leading to an empty data frame. The DE-only narrowPeak files are not directly generated from peak calling on the DE-only BAM files. Instead, I manually extracted the DE peaks from the original narrowPeak files. As a result, the DE-only narrowPeak files might not correspond accurately with the DE-only BAM files. Below is a table showing the number of peaks in the input files for both the DE-only BAM files and the DE-only narrowPeak files: Screen.Shot.2024-08-30.at.1.29.12.PM.png (view on web) https://github.com/user-attachments/assets/31b296bb-8b3e-4942-bcee-4385fb2fb9cb

— Reply to this email directly, view it on GitHub https://github.com/kundajelab/chrombpnet/issues/205#issuecomment-2322029785, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDWEIOKALMXYCQ6YQ34YTZUCTYZAVCNFSM6AAAAABNMZAZPOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRSGAZDSNZYGU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ys3508 commented 2 weeks ago

Are you suggesting that I train the bias model using the original BAM file and narrowPeak files, and then apply this bias model to the DE-only narrowPeak and BAM files for bias-factorized ChromBPNet model training? Thank you for your quick response!

akundaje commented 2 weeks ago

Yes correct

On Fri, Aug 30, 2024 at 10:54 AM ys3508 @.***> wrote:

Are you suggesting that I train the bias model using the original BAM file and narrowPeak files, and then apply this bias model to the DE-only narrowPeak and BAM files for bias-factorized ChromBPNet model training? Thank you for your quick response!

— Reply to this email directly, view it on GitHub https://github.com/kundajelab/chrombpnet/issues/205#issuecomment-2322064947, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDWELMBXDPOPYK2CVWPJLZUCWUZAVCNFSM6AAAAABNMZAZPOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRSGA3DIOJUG4 . You are receiving this because you commented.Message ID: @.***>

ys3508 commented 1 week ago

I implemented the advice you provided, but I encountered a different issue when training the bias-factorized ChromBPNet model. Could you offer any insights into what might be causing this problem?

Thank you, and I hope you have a great Labor Day weekend!

File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 51, in init self.crop_revcomp_data() File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 68, in crop_revcomp_data self.sampled_nonpeak_seqs, self.sampled_nonpeak_cts, self.sampled_nonpeak_coords = subsample_nonpeak_data(self.nonpeak_seqs, self.nonpeak_cts, self.nonpeak_coords, len(self.peak_seqs), self.negative_sampling_ratio) File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 15, in subsample_nonpeak_data nonpeak_indices_to_keep = np.random.choice(len(nonpeak_seqs), size=num_nonpeak_samples, replace=False) File "mtrand.pyx", line 965, in numpy.random.mtrand.RandomState.choice ValueError: Cannot take a larger sample than population when 'replace=False'

ys3508 commented 1 week ago

I also tried using DE-only BAM files, while keeping the narrowpeaks/background input (non-peak regions) derived from all peaks (peak calling on the original BAM files). However, I encountered the following error:

Traceback (most recent call last):
  File "/install/chrombpnet/0.1.7/bin/chrombpnet", line 8, in <module>
    sys.exit(main())
  File "/chrombpnet/chrombpnet/CHROMBPNET.py", line 23, in main
    pipelines.chrombpnet_train_pipeline(args)
  File "/chrombpnet/chrombpnet/pipelines.py", line 37, in chrombpnet_train_pipeline
    find_chrombpnet_hyperparams.main(args_copy)
  File "/chrombpnet/chrombpnet/helpers/hyperparameters/find_chrombpnet_hyperparams.py", line 138, in main
    assert(counts_loss_weight != 0)
AssertionError

The bias model training seems to be correct using the original BAM files and narrowpeaks/background input (without specifically extracting for DE peaks). I can proceed with using this model for bias-free modeling for original BAM files and narrowpeaks/background input. Only when I used DE-only narrowPeak/background and DE-only BAM files, I had errors.

Therefore, what will be a good practice for DE-only footprinting using ChromBpNet?

Thanks,

Best, Yeque

ys3508 commented 1 week ago

I need to mention, that after extracting DE peaks from BAM file and NarrowPeak file, I have extreme small amount of peaks in the narrowpeak file and non-peak region files. I believe this is the reason why I had the following error.

<html xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

DE-only BAM files | 253842 | 124847 | 1548469 | 1074761 | 536346 | 308790 -- | -- | -- | -- | -- | -- | -- DE-only narrowpeak files | 255 | 253 | 1199 | 1170 | 930 | 774 Input background files (non-peaks regions) (fold 1) | 442 | 438 | 2092 | 2041 | 1619 | 1358

I implemented the advice you provided, but I encountered a different issue when training the bias-factorized ChromBPNet model. Could you offer any insights into what might be causing this problem?

Thank you, and I hope you have a great Labor Day weekend!

> File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 51, in **init** self.crop_revcomp_data() File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 68, in crop_revcomp_data self.sampled_nonpeak_seqs, self.sampled_nonpeak_cts, self.sampled_nonpeak_coords = subsample_nonpeak_data(self.nonpeak_seqs, self.nonpeak_cts, self.nonpeak_coords, len(self.peak_seqs), self.negative_sampling_ratio) File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 15, in subsample_nonpeak_data nonpeak_indices_to_keep = np.random.choice(len(nonpeak_seqs), size=num_nonpeak_samples, replace=False) File "mtrand.pyx", line 965, in numpy.random.mtrand.RandomState.choice ValueError: Cannot take a larger sample than population when 'replace=False'

ys3508 commented 1 week ago

Hey, I am trying to decrease NEGATIVE_SAMPLING_RATIO to 0.01 and 0.001 by argument -sr to solve the error. I still want to know if my DE-only BAM/NarrowPeak files have very few peaks (since my DE peaks are few), what is the best practice and choice for me to run ChromBpNet?

Thank you!

Best, Yeque

ys3508 commented 1 day ago

Dear ChromBPNet Team,

I discovered that the error may have been caused by having too few peaks in the input. To address this, I made the following changes:

  1. Reduced the negative sample ratio from the default 0.1 to 0.00001.
  2. Used p-values instead of adjusted p-values in the DE analysis to increase the number of DE peaks. This adjustment has resulted in more peaks for each sample when running ChromBPNet_model_training (see table below).

Despite these changes, I am now encountering a new error. Could you please provide guidance on how to resolve it? Additionally, could you help me understand the possible reasons behind this error?

Thank you for your assistance.

Best regards, Yeque

<html xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

DE-only BAM files | 5870159 | 3316261 | 7996392 | 5812338 | 5270957 | 4592235 -- | -- | -- | -- | -- | -- | -- DE-only narrowpeak files | 10295 | 9990 | 17597 | 17178 | 14901 | 14372 Input background files (non-peaks regions) (fold 1) | 17943 | 17402 | 30705 | 29957 | 25972 | 25075

Here is the error I encountered now:

Traceback (most recent call last):
  File "/chrombpnet/0.1.7/bin/chrombpnet", line 8, in <module>
    sys.exit(main())
  File "/chrombpnet/chrombpnet/CHROMBPNET.py", line 23, in main
    pipelines.chrombpnet_train_pipeline(args)
  File "/chrombpnet/chrombpnet/pipelines.py", line 73, in chrombpnet_train_pipeline
    train.main(args_copy)
  File "/chrombpnet/chrombpnet/training/train.py", line 88, in main
    valid_generator = initializers.initialize_generators(args, "valid", parameters, return_coords=False)
  File "/chrombpnet/chrombpnet/training/data_generators/initializers.py", line 80, in initialize_generators
    generator=batchgen_generator.ChromBPNetBatchGenerator(
  File "/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 36, in __init__
    peak_seqs, peak_cts, peak_coords, nonpeak_seqs, nonpeak_cts, nonpeak_coords, = data_utils.load_data(peak_regions, nonpeak_regions, genome_fasta, cts_bw_file, inputlen, outputlen, max_jitter)
  File "/chrombpnet/chrombpnet/training/utils/data_utils.py", line 87, in load_data
    train_nonpeaks_seqs, train_nonpeaks_cts, train_nonpeaks_coords = get_seq_cts_coords(nonpeak_regions,
  File "/chrombpnet/chrombpnet/training/utils/data_utils.py", line 50, in get_seq_cts_coords
    seq = get_seq(peaks_df, genome, input_width)
  File "/chrombpnet/chrombpnet/training/utils/data_utils.py", line 18, in get_seq
    return one_hot.dna_to_one_hot(vals)
  File "/chrombpnet/chrombpnet/training/utils/one_hot.py", line 18, in dna_to_one_hot
    seq_len = len(seqs[0])
IndexError: list index out of range