jmschrei / tfmodisco-lite

A lite implementation of tfmodisco, a motif discovery algorithm for genomics experiments.
MIT License
56 stars 16 forks source link

Does tfmodisco-lite work with sequences that contain N? #10

Closed coffeebond closed 1 year ago

coffeebond commented 1 year ago

Hi,

I've been trying to run tfmosico-lite, but got this error IndexError: cannot do a non-empty take from an empty axes. This error seems to originate from in _laplacian_null (np.percentile(a=pos_values, q=percentiles_to_use)-mu)). Neither of my two input Numpy arrays contains any NaN. The axes seem correct: (batch, 4, sequence length).

I realize that my sequences are not of the same lengths so they have been padded with Ns on the 5' ends. The N nucleotide is encoded as (0,0,0,0). Do you know if this could be the reason that I got this error?

Thank you!

jmschrei commented 1 year ago

Yes, this might be an issue but I'm not positive. If the corresponding attribution score is near zero (because you're just padding) then it doesn't really matter what nucleotide you put in. Would you mind just assigning a random nucleotide to those padding positions and seeing if it runs?

coffeebond commented 1 year ago

Thanks for the suggestion. The program does indeed require at least one channel to be non-zero. I think the issue stems from this modiscolite/util.py:76: RuntimeWarning: invalid value encountered in divide ppm = ppm/np.sum(ppm, axis=1)[:,None]. I took your advice and randomly assigned nucleotides to N and I could run the program.