kundajelab / basepairmodels

MIT License
16 stars 6 forks source link

[suggestion] Allow the user to give a counts-loss-weight value in input_data.json #15

Open mmtrebuchet opened 3 years ago

mmtrebuchet commented 3 years ago

In training two data sets with wildly different coverages, I'd like to be able to give separate counts loss weights. If I don't do that, then the data sets can have different relative weights between profile and counts losses.

As I see it, there are two ways to do this, and both involve removing the flag from train and adding the information to input_data.json.

  1. Include a counts_loss_weight term on each task in input_data.json. Weight the counts-vs-profile losses for each track accordingly.
  2. Include a counts_loss_alpha term on each task in input_data.json. Run counts_loss_weight on the specified track with the given --alpha value. This way, tracks with different coverages can still be trained at the same alpha. (Or different alphas, if the user wants to do that for some reason.) Of course, you could implement both of these. In that case, you'd have to supply exactly one of these for each track.

This could be accomplished by weighting the input data before training, but for ease of use, it'd be nifty to just specify the alpha for each task and have the code deal with automatically.

Second idea: Allow the user to weight the different tracks to each other. For example, a track with high coverage will have larger loss values, and this can dominate the training loss of a lower-coverage track. So include a parameter (and a corresponding feature in the counts_loss_weight program) that weights the total loss of each track either (1.) in an absolute term, so total_loss = loss[task_1] 0.1 + loss[task_2] 0.9 or (2.) with an alpha-like parameter that accounts for the expected difference in profile loss values based on the coverage, so loss = loss[task_1] get_loss_weight(task_1_coverage, alpha_task_1) + loss[task_2] get_loss_weight(task_2_coverage, alpha_task_2) Again, the user could scale the input data before training, but this would simplify it for the biologists who don't want to manipulate their files by multiplying by magic numbers.

Your thoughts?