error in training own bias model

monikaheinzl commented 1 year ago

Hi,

I'm trying to train my own bias model on my ATAC-seq data with the chrombpnet bias pipeline command. Before, I followed the preprocessing steps as described in your documentation. The training of the bias model seemed to work (see output below) because the model was also saved on my disk, but afterwards, it failed at one step (see log below). It would be great if you could help me out here.

Many thanks, Monika

log:

Traceback (most recent call last):
  File "/.local/bin/chrombpnet", line 33, in <module>
    sys.exit(load_entry_point('chrombpnet', 'console_scripts', 'chrombpnet')())
  File "/.conda/envs/chrombpnet/chrombpnet/chrombpnet/CHROMBPNET.py", line 38, in main
    pipelines.train_bias_pipeline(args)
  File "/.conda/envs/chrombpnet/chrombpnet/chrombpnet/pipelines.py", line 328, in train_bias_pipeline
    predict.main(args_copy)
  File "/.conda/envs/chrombpnet/chrombpnet/chrombpnet/training/predict.py", line 105, in main
    test_generator = initializers.initialize_generators(args, mode="test", parameters=None, return_coords=True)
  File "/.conda/envs/chrombpnet/chrombpnet/chrombpnet/training/data_generators/initializers.py", line 80, in initialize_generators
    generator=batchgen_generator.ChromBPNetBatchGenerator(
  File "/.conda/envs/chrombpnet/chrombpnet/chrombpnet/training/data_generators/batchgen_generator.py", line 36, in __init__
    peak_seqs, peak_cts, peak_coords, nonpeak_seqs, nonpeak_cts, nonpeak_coords, = data_utils.load_data(peak_regions, nonpeak_regions, genome_fasta, cts_bw_file, inputlen, outputlen, max_jitter)
  File "/.conda/envs/chrombpnet/chrombpnet/chrombpnet/training/utils/data_utils.py", line 79, in load_data
    train_peaks_seqs, train_peaks_cts, train_peaks_coords = get_seq_cts_coords(bed_regions,
  File "/.conda/envs/chrombpnet/chrombpnet/chrombpnet/training/utils/data_utils.py", line 50, in get_seq_cts_coords
    seq = get_seq(peaks_df, genome, input_width)
  File "/.conda/envs/chrombpnet/chrombpnet/chrombpnet/training/utils/data_utils.py", line 18, in get_seq
    return one_hot.dna_to_one_hot(vals)
  File "/.conda/envs/chrombpnet/chrombpnet/chrombpnet/training/utils/one_hot.py", line 19, in dna_to_one_hot
    assert np.all(np.array([len(s) for s in seqs]) == seq_len)
AssertionError

output:

Estimating enzyme shift in input file
Current estimated shift: +0/+0
Making BedGraph
Making Bigwig
non zero bigwig entries in the given chromosome:  8652683
evaluating hyperparameters on the following chromosomes ['chr3R', 'chr3L', 'chrX', 'chr2RHet', 'chr3LHet', 'chr3RHet', 'chr4', 'chr2LHet', 'chrYHet', 'chrXHet', 'chr2L']
Number of non peaks input:  22546
Number of non peaks filtered because the input/output is on the edge:  0
Number of non peaks being used:  22546
Number of non peaks input:  2438
Number of non peaks filtered because the input/output is on the edge:  0
Number of non peaks being used:  2438
Number of peaks input:  11275
Number of peaks filtered because the input/output is on the edge:  2
Number of peaks being used:  11273
Upper bound counts cut-off for bias model training:  347.86
Number of nonpeaks after the upper-bount cut-off:  2937
Number of nonpeaks after applying upper-bound cut-off and removing outliers :  1944
counts_loss_weight: 21.0
{'counts_loss_weight': '21.0', 'filters': '128', 'n_dil_layers': '4', 'inputlen': '2114', 'outputlen': '1000', 'max_jitter': '0', 'chr_fold_path': 'data/splits/fold_0.json', 'negative_sampling_ratio': '1.0'}
params:
filters:128
n_dil_layers:4
conv1_kernel_size:21
profile_kernel_size:75
counts_loss_weight:21.0
got the model
loading nonpeaks...
got split:train for bed regions:(1764, 10)
loading nonpeaks...
got split:valid for bed regions:(180, 10)
Epoch 1/50
 1/28 [>.............................] - ETA: 3:50 - loss: 1081.2889 - logits_profile_predictions_loss: 540.8846 - logcount_predictions_loss: 25.7335
...
Epoch 15: val_loss did not improve from 341.58417
Restoring model weights from the end of the best epoch: 10.
28/28 [==============================] - 206s 7s/step - loss: 417.5620 - logits_profile_predictions_loss: 381.1227 - logcount_predictions_loss: 1.7352 - val_loss: 352.3586 - val_logits_profile_predictions_loss: 269.3966 - val_logcount_predictions_loss: 3.9506
Epoch 15: early stopping
save model
got the model
loading peaks...
got split:test for bed regions:(2439, 10)
loading nonpeaks...
got split:test for bed regions:(2438, 10)

panushri25 commented 1 year ago

Hello @monikaheinzl, Looking into this...will get back to you if I need anything else from you.

monikaheinzl commented 1 year ago

Hi @panushri25, could you already find out what the problem might be here?

panushri25 commented 1 year ago

Hello @monikaheinzl, yes I have a patch for this, will release today.

I have a quick question - why do you have so few peak? Which model organism are you training the model on?

monikaheinzl commented 1 year ago

Hi, awesome, many thanks! I'm having fewer peaks because I'm training on Drosophila data. Do you think the low number of peaks will be a problem for ChromBPnet?

panushri25 commented 1 year ago

No it shouldn't be. But I just wanted to make sure you are calling relaxed peaks as detailed in the tutorial here - https://github.com/kundajelab/chrombpnet/wiki/Preprocessing#peak-calling

panushri25 commented 1 year ago

Hello,

I updated the code to fix this, can you test this out and let me know?

You can test it out in two ways -

(1) Install from test pypi. Recommend a new conda environment which should be deleted after using so as to not conflict it with the original pypi release.

python3 -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ chrombpnet==0.2.23

(2) Install from GitHub and test the code. The changes are updated in the GitHub repo. Let me know how this works for you.

Thank you, Anu

panushri25 commented 1 year ago

closing this due to inactivity. Please reopen if anything comes up.

kundajelab / chrombpnet

error in training own bias model #143