DNA interpretation: input shape, and multiple inputs to model where only one is seuqnece

ofiryaish commented 3 years ago

Hello,

I am trying to run the tool with my own model and data and found it hard to follow all the tool options.

I have a model which receives a list of two inputs where the first input is the DNA sequence of shape (n, l, 4), and the second one is of shape (n, f) where n, l, and f are the number of samples, the length of the sequence, and the number of additional features, respectively. I'm trying to find the negative and positives attribution scores of the DNA sequences samples. From my last question, I understand that I need to use nll_mode='minimization or 'maximization'. When trying to apply this, I have some questions regarding the shape of the input to a scrambler and the fact that I insert multiple inputs where only one of them is the sequence, and therefore their shapes are different.

1) regarding the input shape: I saw in the examples codes that the input shape is of size (n, 1, l, 4). Must it be in this shape? As I explained, my model knows to receive input only shapes of (n, l, 4). I can modify my model to accept (n, 1, l, 4), but the question is whether it is optional to modify the scrambler arguments to accept shapes of (n, l, 4)? If scrambler does accept shapes of (n,l,4), can you please explain how I modify the arguments here:

#Initialize scrambler
scrambler = Scrambler(
    scrambler_mode='inclusion',
    input_size_x=1,
    input_size_y=l,
    n_out_channels=4,
    input_templates=[onehot_template],
    input_backgrounds=[x_mean],
    batch_size=32,
    n_samples=32,
    sample_mode='st',
    zeropad_input=False,
    mask_dropout=False,
    network_config=network_config
)

2) Now, for the more complicated issue, as I explained, my model has multiple inputs where only one of them is DNA sequence. I'm interested only in its importance scores (I don't mind getting the importance scores of the other input as we got in Integrated gradients and other tools, but they are not important). I saw that you have an example of proteins where you support multiple inputs as I have; however, I see that the inputs have the same shape; therefore, I could not follow the example. In addition, I see that you have an argument called multi_input_mode which maybe solve the issue when not choosing 'siamese', but I don't understand what other options it has (I also saw the code, and I think that this argument is not expecting other arguments than 'siamese'). So, my question is whether it is optional to insert multiple inputs to the scrambler where the inputs don't have the same shape (and only one input importance scores are important)? If it is feasible, I would like to know how.

Thank you for your time, Ofir

johli commented 3 years ago

Dear Ofir, thank you for your questions. I'll try and answer them below:

The scrambler expects that the DNA sequence input is in the format (n, 1, l, 4), because the scrambler model uses 2D convolutions internally. I plan on making a code change to variably switch to 1D convolutions if receiving an input on the format (n, l, 4). This is not implemented yet but I plan on coding it up sometime this week or the next. So for now, if you want to use the method you need to change your predictor to accept sequences on the form (n, 1, l, 4).
The current version of the code does not accept additional "non-scrambled" inputs to the predictor the way you describe it, although it does sound like an important/common thing to allow, and I don't think it's a particularly complicated code change, so I will try to add support for this during this week hopefully.

I will let you know in this thread when these changes are implemented and available on github.

Best,

Johannes

ofiryaish commented 3 years ago

Indeed, it very important feature as we have many "side features" with different shapes, and for some, the interpretation is not important as it does for others.

About the input shape, It is not really a problem, just more convenient. Usually, when dealing with DNA sequences, I apply 1D-convolution.

Thank you again, and I'm looking for the updates

johli commented 3 years ago

Dear Ofir,

I have just pushed a set of updates to the reoo, including the following changes that you requested:

Support for additional predictor inputs. These are passed with the 'extra_input_train' and 'extra_input_test' arguments in the scrambler.train function.
Support for 1d sequence tensors, e.g. shape = (n, l, 4)
Support for custom loss functions, where you for example can index a specific regressor output.

I made a new example notebook to illustrate how to use these three features: https://github.com/johli/scrambler/blob/master/examples/dna/scrambler_apa_example_custom_loss.ipynb

In this example I maximize RNA pA 3' Cleavage at different positions in a sequence, using a predictor model that has 4 different outputs. In your case, since you want to minimize your regressors, you should remove the negative sign from the "maximize_cleavage_logodds" custom loss function. The custom loss functions expect two arguments: (1) the non-scrambled (original) predictor output and (2) the scrambled predictor outputs. These two arguments will be passed to the loss function from within the scrambler.train code.

Note: The extra predictor inputs will only be passed to the predictor, not the scrambler itself. This can potentially make the scrambler produce lower-quality interpretations if there is a lot of variation in these extra inputs (if they explain a lot of the prediction). If these features vary a lot in your data, I would suggest trying to set the flag "'label_input' : True" in the network_config dictionary. This will send the y_train values as additional input to the Scrambler, which will help it to discount variation outside the sequence inputs.

Let me know if you come across any issues!

Best,

Johannes

ofiryaish commented 3 years ago

Hey Johannes,

Sorry for the late response.

First, thank you for adding the new features. I will try to check them and update you if I have issues.

johli / scrambler

DNA interpretation: input shape, and multiple inputs to model where only one is seuqnece #4