Regarding the output of the example for Basenji

jeffmylife commented 6 years ago

So I've successfully ran the Basenji on the example_files/ but still have some comments/questions before I do this for my data.

First, the output file Basenji.example_pred.tsv has some complications. It cannot be "headed" easily or looped through easily. screen shot 2018-06-01 at 4 22 41 pm After transposing the data, you get:
screen shot 2018-06-01 at 4 26 01 pm which is much easier to work with. Additionally, the values for every interval supplied from interveral.bed are equal. It seems redundant to have these chromosomal regions as examples if they predict the same thing. Is there any reason for this?

Second, notice that the data in the above photo is ambiguous. "pred/n/m " doesn't tell me much. I now know that 'n' ranges from [0,960] and 'm' ranges from [0,4229]. m * n = 4059840, or the number of output values of the model. I see that m corresponds to sequencing reads from the various experiments that the model predicts. However, the mapping of the number (0 through 4229) in the file to the identifier in Supplementary Table 1 is not clear since the there is no numeral ID; I've presumed 0 corresponds to the first data point, though.

Most importantly, what I do not understand is the meaning of 'n'. The corresponding article has no reference to the number 960. Given that the model's goal is to predict experimental data for regions of the genome, I presume that 'n' corresponds to binned locations of the input sequence. However, the bin size used is supposed to be 128bp (2^7) and the input sequence length is 131,072bp (2^17). 131,072bp / 128bp = 1,024 bins per input sequence This implies that 'n' (the numbers 0 to 960) is not the bins because it would mean that 131,072bp / 960 bins = ~136.5 bp per bin
which I don't think is allowed. So what is the meaning of the output in Basenji.example_pred.tsv ?

Any guidance would be helpful. I am aware that this question might better be directed at the Basenji github, but I thought I'd try here first. Thanks

Avsecz commented 6 years ago

It cannot be "headed" easily or looped through easily.

I suggest you to use hdf5 files to store the predictions (just specify -o myfile.h5 instead of -o myfile.tsv. You can use kipoi.readers.HDF5Reader.load("file.h5") to load it into python. There you get the nice arrays. The .tsv file output is meant more for simpler models predicting a single value.

the values for every interval supplied from interveral.bed are equal. It seems redundant to have these chromosomal regions as examples if they predict the same thing.

We just keep copying the same small bed file for the models for simplicity. It doesn't have any particular meaning. It's there to make sure things run. You have to create your own bed file to score particular regions. For basenji I guess it's best to create intervals that slightly overlap.

So what is the meaning of the output in Basenji.example_pred.tsv?

The output array for basenji is: (960, 4229) (as written in the schema filed of model.yaml)

m: 4229 is the number of different output tracks (I think it goes by row numbers in the Supplementary table 1 as you noted)
n: 960 is the number of bins along the sequence

The caveat here is the basenji uses dilated convolutions with valid padding (I think), hence the input sequence has to be be wider than the output. So basically you have

4096bp starting flank sequence
122880bp core sequence (960 * 128), predicted by the model in 128 bins
4096bp end flank sequence

For getting the nice tissue types and assay type, you should open an issue on the basenji repo.

I would strongly appreciate if you could submit a PR with an update to the schema doc field in the model.yaml file for Basenji: https://github.com/kipoi/models/blob/master/Basenji/model.yaml#L51 describing the details of the 4229 features after you figure it out. You can put in the link to the Supplementary table 1. You can use the multi-line syntax of yaml:

schema:
  inputs:
    name: seq
    special_type: DNASeq
    shape: (131072, 4)
    doc: > 
      one-hot encoded DNA sequence. 
      - 4096bp starting flank sequence
      - 122880bp core sequence (960 * 128), predicted by the model in 128 bins
       - 4096bp end flank sequence
        ....
    associated_metadata: ranges
  targets:
    name: genomic_features
    shape: (960, 4229)
    doc: > 
     Specific Basenji output features.
     link
     4229 is the number of different output tracks (I think it goes by row numbers in the Supplementary table 1 as you noted)
    960 is the number of bins along the sequence

     ....

jeffmylife commented 6 years ago

All of that makes sense, thank you. As for confirmation from the Basenji repo, I opened an issue and will update the mentioned model.yaml as soon as I get a reply.

Also, I've reached this

NotImplementedError: Don't know how to deal with multi-dimensional model target 
ArraySchema(shape=(960, 4229), doc='Specific Basenji output features.', 
name='genomic_features', special_type=None, associated_metadata=[], column_labels=None)
in /anaconda3/lib/python3.6/sitepackages/kipoi/postprocessing/variant_effects/utils/dna_reshapers.py in get_column_names(arrayschema_obj)

and was thinking that it might be easier to deal with the 960 bins individually so that you don't have to recode the functions within dna_reshapers.py. Perhaps you could let the problem trickle up to whatever calls dna_reshapers, but I'm not sure on the details. Any suggestions on how to do the variant effect predictions with basenji for now?

Thanks again

Avsecz commented 6 years ago

@krrome can you help with the var_effect module?

krrome commented 6 years ago

Hi, thanks for bringing that forward. The variant effect prediction pipeline can at the moment only handle 1D model output arrays in order to produce sensible VCF file annotation. The 1D model output array is then implicitly treated as a multi-task model.

Most of the machinery of variant effect prediction will still work fine with a model like Basenji, but as you pointed out, if one wants to get a 1D annotation of a variant this transformation nD->1D has to be defined somewhere and cannot be set fixed and generically for all models. There are two solutions to that problem:

1) Design a new Basenji/variantEffects model that performs said conversion of its outputs down to the 4229 output tasks (If I understand Basenji correctly, but also I have never worked with it) 2) Add the possibility to define a transformation function nD -> 1D for the model output that will always be called/executed on model outputs. We can then define a set of "useful" nD -> 1D output conversions that can be used with these kinds of models.

If you are in a rush to get the predictions I would recommend performing "model surgery" by adding the desired transformation to 1D as a final layer, e.g. max pooling or average in one of the two dimensions, or what you think make sense for your variant effects. I will create a development issue for solution 2). I hope this helps and thank you for your input and ideas.

jeffmylife commented 6 years ago

Good ideas. I'll doctor something up in the meantime using your second suggestion.

Thank you for helping

Avsecz commented 6 years ago

@krrome do we allow any summarization function? What do you think about the idea to allow the user to specify their own numpy function (e.g. "np.sum(x, axis=-1)")?

krrome commented 6 years ago

@Avsecz exactly, that's what I meant - a function that reduces model output dimensionality to 1D. I move this discussion to: https://github.com/kipoi/kipoi/issues/282

kipoi / models

Regarding the output of the example for Basenji #87