philippdre / omniCLIP

omniCLIP is a CLIP-Seq peak caller
GNU General Public License v3.0
15 stars 9 forks source link

numpy error on test data #3

Closed fgypas closed 4 years ago

fgypas commented 5 years ago

Hi

I am trying to run omniclip but I get an error regarding numpy when I run the test example (https://github.com/philippdre/omniCLIP#examples). Below is the log:

/usr/local/lib/python2.7/dist-packages/h5py/_hl/dataset.py:313: H5pyDeprecationWarning: dataset.value has been deprecated. Use dataset[()] instead. "Use dataset[()] instead.", H5pyDeprecationWarning) /opt/omniCLIP/data_parsing/tools.py:858: RuntimeWarning: invalid value encountered in double_scalars med = ((temp_med_floor tot_floor) + (temp_med_ceil tot_ceil)) / (tot_floor + tot_ceil) /usr/local/lib/python2.7/dist-packages/scipy/sparse/linalg/dsolve/linsolve.py:253: SparseEfficiencyWarning: splu requires CSC matrix format warn('splu requires CSC matrix format', SparseEfficiencyWarning) /usr/local/lib/python2.7/dist-packages/scipy/optimize/_minimize.py:600: RuntimeWarning: Method 'bounded' does not support relative tolerance in x; defaulting to absolute tolerance. "defaulting to absolute tolerance.", RuntimeWarning) Namespace(bg_collapsed=False, bg_libs=['example_data/RZ_rep1_chr1.bam', 'example_data/RZ_rep2_chr1.bam'], bg_type='Coverage_bck', diag_bg=False, diag_event_mod='DirchMultK', emp_var=False, fg_collapsed=True, fg_libs=['example_data/PUM2_rep1_chr1.bam', 'example_data/PUM2_rep2_chr1.bam'], fg_pen=0.0, filter_snps=False, gene_anno_file='example_data/gencode.v19.annotation.chr1.gtf.db', gene_sample=100000, genome_dir='example_data/hg37/', glm_weight=-1.0, ign_GLM=False, ign_diag=False, ign_out_rds=False, mask_flank_variants=3, mask_miRNA=False, mask_ovrlp=True, max_it=20, max_it_glm=10, max_mm=2, nb_proc=1, norm_class=False, nr_mix_comp=1, only_coverage=False, only_pred=False, out_dir='example_data', overwrite_bg=True, overwrite_fg=True, pred_sites=False, pseudo_count=None, pv_cutoff=0.05, restart_from_file=False, rev_strand=None, rnd_seed=None, safe_tmp=False, skip_diag_event_mdl=False, snps_min_cov=10, snps_thresh=0.2, subs=True, thresh=None, tmp_dir=None, tol_lg_lik=10000.0, tr_type='binary', use_precomp_diagmod=None, verbosity=0) Loading gene annotation Memory usage: 93564 (kb) Loading reads Parsing the gene annotation Processing chr1 Saving results Parsing the gene annotation Processing chr1 Saving results Masking overlapping positions Removing genes without CLIP coverage Done: Elapsed time: 1127.059973 Memory usage: 1154820 (kb) Initialising the parameters Iteration: 0 Computing most likely path

Done: Elapsed time: 186.006131887 Memory usage: 3340920 (kb) Fitting emission parameters Memory usage: 3340920 (kb) Fitting emission parameters Estimating expression parameters Memory usage: 3340920 (kb) Start estimation of expression parameters Constructing GLM matrix Estimating expression parameters: before GLMMatrix Memory usage: 3340920 (kb) Estimating expression parameters: after GLMMatrix Memory usage: 3340920 (kb) Done: Elapsed time: 16.7295198441 Estimating expression parameters: before GLMMatrix Memory usage: 3340920 (kb) Fitting GLM Estimating expression parameters: before fitting Memory usage: 3340920 (kb) [[-5.86262797] [-1.39794241] [-6.40704993]] Dispersion 4.73991326495 1323538.14147 [[-4.99527274] [-1.55260304] [-5.91156312]] Dispersion 4.75639033771 2897.6338165 [[-4.99320863] [-1.55298395] [-5.91074454]] Dispersion 4.75646887385 13.784992224 Estimating expression parameters: afer fitting Memory usage: 3340920 (kb) Estimating expression parameters: afer cleanup Memory usage: 3340920 (kb) Done: Elapsed time: 35.1242051125 Finishes expression parameter estimation Memory usage: 3340920 (kb) computing sufficient statitics for fitting md Memory usage: 3340920 (kb) Getting suffcient statistic Done: Elapsed time: 151.128564119 Memory usage: 3479096 (kb) fitting md distribution Memory usage: 3479096 (kb) Estimating state 0 Estimating state 1 Estimating state 2 Estimating state 3 Memory usage: 3479096 (kb) Done: Elapsed time: 209.871788979 Memory usage: 3479096 (kb) Fitting transistion parameters Memory usage: 3479096 (kb) Fitting transistion parameters Memory usage: 3479096 (kb) Learning transistion model Iterating over genes Fitting transistion parameters: I Memory usage: 3479096 (kb) .Fitting transistion parameters: II Memory usage: 3479096 (kb) Fitting transistion parameters: III Memory usage: 4153612 (kb) Fitting transistion parameters: IV Memory usage: 4733776 (kb) Done: Elapsed time: 189.882477045 Fitting transistion parameters: V Memory usage: 4733776 (kb) Memory usage: 4733776 (kb) Memory usage: 4733776 (kb) Computing most likely path Memory usage: 4733776 (kb) Computing most likely path

Done: Elapsed time: 340.688632011 Memory usage: 4733776 (kb) LogLik: -276419958.813 Log-likelihood: -276419958.813 [-276419958.81327856] Iteration: 1 Fitting emission parameters Memory usage: 4733776 (kb) Fitting emission parameters Estimating expression parameters Memory usage: 4733776 (kb) Start estimation of expression parameters Constructing GLM matrix Estimating expression parameters: before GLMMatrix Memory usage: 4733776 (kb) Estimating expression parameters: after GLMMatrix Memory usage: 4733776 (kb) Done: Elapsed time: 18.5351040363 Estimating expression parameters: before GLMMatrix Memory usage: 4733776 (kb) Fitting GLM Estimating expression parameters: before fitting Memory usage: 4733776 (kb) [[ -5.60605638] [ -1.35371957] [-10.97159246]] Dispersion 1.53997308875 3284102.57242 [[ -5.60416194] [ -1.50858191] [-10.96700079]] Dispersion 1.43349552661 249546.407159 [[ -5.60728083] [ -1.52066257] [-10.96685927]] Dispersion 1.42854404723 12200.5147202 Estimating expression parameters: afer fitting Memory usage: 4733776 (kb) Estimating expression parameters: afer cleanup Memory usage: 4733776 (kb) Done: Elapsed time: 52.1908521652 Finishes expression parameter estimation Memory usage: 4733776 (kb) computing sufficient statitics for fitting md Memory usage: 4733776 (kb) Getting suffcient statistic Done: Elapsed time: 147.203412056 Memory usage: 4733776 (kb) fitting md distribution Memory usage: 4733776 (kb) Estimating state 0 Estimating state 1 Estimating state 2 Estimating state 3 Memory usage: 4733776 (kb) Done: Elapsed time: 223.586141825 Memory usage: 4733776 (kb) Fitting transistion parameters Memory usage: 4733776 (kb) Fitting transistion parameters Memory usage: 4733776 (kb) Learning transistion model Iterating over genes Fitting transistion parameters: I Memory usage: 4733776 (kb) .Fitting transistion parameters: II Memory usage: 4733776 (kb) Fitting transistion parameters: III Memory usage: 4733776 (kb) Fitting transistion parameters: IV Memory usage: 5037056 (kb) Done: Elapsed time: 193.760185003 Fitting transistion parameters: V Memory usage: 5037056 (kb) Memory usage: 5037056 (kb) Memory usage: 5037056 (kb) Computing most likely path Memory usage: 5037056 (kb) Computing most likely path

Done: Elapsed time: 322.033174038 Memory usage: 5037056 (kb) LogLik: -197388911.42 Log-likelihood: -197388911.42 [-276419958.81327856, -197388911.41977647] Iteration: 2 Fitting emission parameters Memory usage: 5037056 (kb) Fitting emission parameters Estimating expression parameters Memory usage: 5037056 (kb) Start estimation of expression parameters Constructing GLM matrix Estimating expression parameters: before GLMMatrix Memory usage: 5037056 (kb) Estimating expression parameters: after GLMMatrix Memory usage: 5037056 (kb) Done: Elapsed time: 18.5401170254 Estimating expression parameters: before GLMMatrix Memory usage: 5037056 (kb) Fitting GLM Estimating expression parameters: before fitting Memory usage: 5037056 (kb) [[ -5.82820056] [ -1.59726161] [-10.73771781]] Dispersion 1.18274382013 580832.293091 [[ -5.89526679] [ -1.62644377] [-10.736914 ]] Dispersion 1.17150958289 30115.415234 [[ -5.89871413] [ -1.62797896] [-10.73687714]] Dispersion 1.17097272347 1448.1699915 Estimating expression parameters: afer fitting Memory usage: 5037056 (kb) Estimating expression parameters: afer cleanup Memory usage: 5037056 (kb) Done: Elapsed time: 41.142747879 Finishes expression parameter estimation Memory usage: 5037056 (kb) computing sufficient statitics for fitting md Memory usage: 5037056 (kb) Getting suffcient statistic Done: Elapsed time: 114.98391819 Memory usage: 5037056 (kb) fitting md distribution Memory usage: 5037056 (kb) Estimating state 0 Estimating state 1 Estimating state 2 Estimating state 3 Memory usage: 5037056 (kb) Done: Elapsed time: 180.077622175 Memory usage: 5037056 (kb) Fitting transistion parameters Memory usage: 5037056 (kb) Fitting transistion parameters Memory usage: 5037056 (kb) Learning transistion model Iterating over genes Fitting transistion parameters: I Memory usage: 5037056 (kb) .Fitting transistion parameters: II Memory usage: 5037056 (kb) Fitting transistion parameters: III Memory usage: 5037056 (kb) Fitting transistion parameters: IV Memory usage: 5707136 (kb) Done: Elapsed time: 174.015991926 Fitting transistion parameters: V Memory usage: 5707136 (kb) Memory usage: 5707136 (kb) Memory usage: 5707136 (kb) Computing most likely path Memory usage: 5707136 (kb) Computing most likely path

Done: Elapsed time: 318.18231988 Memory usage: 5707136 (kb) LogLik: -196902982.375 Log-likelihood: -196902982.375 [-276419958.81327856, -197388911.41977647, -196902982.37539276] Iteration: 3 Fitting emission parameters Memory usage: 5707136 (kb) Fitting emission parameters Estimating expression parameters Memory usage: 5707136 (kb) Start estimation of expression parameters Constructing GLM matrix Estimating expression parameters: before GLMMatrix Memory usage: 5707136 (kb) Estimating expression parameters: after GLMMatrix Memory usage: 5707136 (kb) Done: Elapsed time: 18.618844986 Estimating expression parameters: before GLMMatrix Memory usage: 5707136 (kb) Fitting GLM Estimating expression parameters: before fitting Memory usage: 5707136 (kb) [[ -5.90353974] [ -1.77144729] [-10.59990291]] Dispersion 1.15870280174 33722.280353 [[ -5.9076541 ] [ -1.77311538] [-10.59985689]] Dispersion 1.15824039001 1279.57551063 [[ -5.90781012] [ -1.77317875] [-10.59985515]] Dispersion 1.15822292273 48.3475619276 Estimating expression parameters: afer fitting Memory usage: 5707136 (kb) Estimating expression parameters: afer cleanup Memory usage: 5707136 (kb) Done: Elapsed time: 29.4881711006 Finishes expression parameter estimation Memory usage: 5707136 (kb) computing sufficient statitics for fitting md Memory usage: 5707136 (kb) Getting suffcient statistic Done: Elapsed time: 114.442127228 Memory usage: 5707136 (kb) fitting md distribution Memory usage: 5707136 (kb) Estimating state 0 Estimating state 1 Estimating state 2 Estimating state 3 Memory usage: 5707136 (kb) Done: Elapsed time: 168.16003108 Memory usage: 5707136 (kb) Fitting transistion parameters Memory usage: 5707136 (kb) Fitting transistion parameters Memory usage: 5707136 (kb) Learning transistion model Iterating over genes Fitting transistion parameters: I Memory usage: 5707136 (kb) .Fitting transistion parameters: II Memory usage: 5707136 (kb) Fitting transistion parameters: III Memory usage: 5707136 (kb) Fitting transistion parameters: IV Memory usage: 5707136 (kb) Done: Elapsed time: 156.465799809 Fitting transistion parameters: V Memory usage: 5707136 (kb) Memory usage: 5707136 (kb) Memory usage: 5707136 (kb) Computing most likely path Memory usage: 5707136 (kb) Computing most likely path

Done: Elapsed time: 315.280578136 Memory usage: 5707136 (kb) LogLik: -196100507.253 Log-likelihood: -196100507.253 [-276419958.81327856, -197388911.41977647, -196902982.37539276, -196100507.25284377] Iteration: 4 Fitting emission parameters Memory usage: 5707136 (kb) Fitting emission parameters Estimating expression parameters Memory usage: 5707136 (kb) Start estimation of expression parameters Constructing GLM matrix Estimating expression parameters: before GLMMatrix Memory usage: 5707136 (kb) Estimating expression parameters: after GLMMatrix Memory usage: 5707136 (kb) Done: Elapsed time: 18.4652509689 Estimating expression parameters: before GLMMatrix Memory usage: 5707136 (kb) Fitting GLM Estimating expression parameters: before fitting Memory usage: 5707136 (kb) [[ -6.19819175] [ -2.05475657] [-10.53295815]] Dispersion 1.11540622662 125695.069891 [[ -6.21999346] [ -2.06097515] [-10.53278674]] Dispersion 1.11412036155 3865.32564373 [[ -6.22065613] [ -2.06116729] [-10.53278158]] Dispersion 1.11408137009 117.293383519 Estimating expression parameters: afer fitting Memory usage: 5707136 (kb) Estimating expression parameters: afer cleanup Memory usage: 5707136 (kb) Done: Elapsed time: 30.2455918789 Finishes expression parameter estimation Memory usage: 5707136 (kb) computing sufficient statitics for fitting md Memory usage: 5707136 (kb) Getting suffcient statistic Done: Elapsed time: 116.862193823 Memory usage: 5707136 (kb) fitting md distribution Memory usage: 5707136 (kb) Estimating state 0 Estimating state 1 Estimating state 2 Estimating state 3 Traceback (most recent call last): File "/opt/omniCLIP/omniCLIP.py", line 930, in run_omniCLIP(args) File "/opt/omniCLIP/omniCLIP.py", line 324, in run_omniCLIP CurrLogLikelihood, IterParameters, First, Paths = PerformIteration(Sequences, Background, IterParameters, NrOfStates, First, Paths) File "/opt/omniCLIP/omniCLIP.py", line 608, in PerformIteration NewEmissionParameters = FitEmissionParameters(Sequences, Background, NewPaths, EmissionParameters, First) File "/opt/omniCLIP/omniCLIP.py", line 733, in FitEmissionParameters NewEmissionParameters = mixture_tools.em(Counts, NrOfCounts, NewEmissionParameters, x_0=OldAlpha, First=First) File "/opt/omniCLIP/stat/mixture_tools.py", line 66, in em alpha, mixtures = Parallel_estimate_mixture_params(OldEmissionParameters, curr_counts, curr_nr_of_counts, curr_state, rand_sample_size, max_nr_iter, nr_of_iter=20, stop_crit=1.0, nr_of_init=10) File "/opt/omniCLIP/stat/mixture_tools.py", line 272, in Parallel_estimate_mixture_params scored_counts = score_counts(curr_counts, curr_state, EmissionParameters) File "/opt/omniCLIP/stat/mixture_tools.py", line 428, in score_counts scored_counts[mix_comp, :] = diag_event_model.pred_log_lik(counts, state, EmissionParameters, single_mix=mix_comp) File "/opt/omniCLIP/stat/diag_event_model.py", line 76, in pred_log_lik Prob = FitBinoDirchEmmisionProbabilities.ComputeStateProbForGeneMD_unif_rep(counts, alpha[:, single_mix], state, EmissionParameters) File "/opt/omniCLIP/stat/FitBinoDirchEmmisionProbabilities.py", line 162, in ComputeStateProbForGeneMD_unif_rep Prob[IxZeros] = np.tile(RatioLikelihood[0, 0] , (1, np.sum(IxZeros))) TypeError: NumPy boolean array indexing assignment requires a 0 or 1-dimensional input, input has 2 dimensions

Do you have any idea what is wrong? I think that it might be related to the dependencies. Can you please tell me the exact versions of the python dependencies I should use. In the README.md, you mention the following packages, but it is not very specific. Should it really be > or >= is enough?

biopython (> v.1.68)
brewer2mpl (> v.1.4)
cython (> v.0.24.1)
gffutils (> v.0.8.7.1)
h5py (> v.2.6.0)
intervaltree (> v.2.1.0)
matplotlib (> v.1.5.3)
numpy (> v.1.11.3)
pandas (> v0.19.0)
prettyplotlib (> v.0.1.7)
pysam (> v.0.9.1.4)
scikit-learn (> v.0.18.1)
scipy (> v.0.19.0)
statsmodels (> v.0.6.1)

My pip freeze looks like the following:

argcomplete==1.10.0
argh==0.26.2
asn1crypto==0.24.0
biopython==1.68
brewer2mpl==1.4
cryptography==2.1.4
cycler==0.10.0
Cython==0.24.1
enum34==1.1.6
gffutils==0.8.7.1
h5py==2.9.0
idna==2.6
intervaltree==2.1.0
ipaddress==1.0.17
keyring==10.6.0
keyrings.alt==3.0
matplotlib==1.5.3
numpy==1.13.3
pandas==0.19.2
patsy==0.5.1
prettyplotlib==0.1.7
pycrypto==2.6.1
pyfaidx==0.5.5.2
pygobject==3.26.1
pyparsing==2.4.1
pysam==0.9.1.4
python-dateutil==2.8.0
pytz==2019.1
pyxdg==0.25
scikit-learn==0.18.1
scipy==0.19.0
SecretStorage==2.3.1
simplejson==3.16.0
six==1.11.0
sortedcontainers==2.1.0
statsmodels==0.6.1

Do you maybe have a Dockerfile that I could use? This would be perfect.

Thank you in advance for your help

Kind regards Foivos

philippdre commented 5 years ago

Der Foivos,

could you please tell me if the error is reproducible?

Best regards, Philipp

fgypas commented 5 years ago

Hi @philippdre

Thank you for your reply. What do you mean by reproducible? Do you mean that sometimes it is working and sometimes it is not?

Thank you in advance for your response

Best Foivos

philippdre commented 5 years ago

Model fitting is stochastic in omniCLIP and in some cases, omniCLIP crashes if the there exists a state which is not the most likely state for at least one nucleotide in the genome. In this case rerunning omniCLIP helps.

fgypas commented 5 years ago

Hi @philippdre

Thank you for the quick response. So, most probably this is the case, but I will do some further tests to validate this. Maybe it would be good to mention it in the README.md.

Best Foivos

philippdre commented 5 years ago

I will try to catch this error in the next release.

Best regards, Philipp

philippdre commented 4 years ago

This should be fixed in the current release