gregversteeg / bio_corex

A flexible version of CorEx developed for bio-data challenges that handles missing data, continuous/discrete variables, multi-CPU, overlapping structure, and includes visualizations
Apache License 2.0
139 stars 29 forks source link

Issue while running vis_corex #15

Open prasanna224 opened 5 years ago

prasanna224 commented 5 years ago

While running a file with the following arguments, I am getting an error after 24 hours of script run time.

Command:

python3 vis_corex.py /home/ppandey/dx_desc.csv --delimiter="|" --layers=32,16,8,1 --dim_hi dden=3 --missing=-1e6 -c -b -v -o dxm --ram=72 --cpu=36

Sample File:

DX101|DX110|DX115|DX118|DX142|DX143|DX155|DX160|DX166|DX169|DX175|DX184|DX196|DX212|DX215|DX218|DX222|DX223|DX234|DX235|DX239|DX253|DX254|DX267|DX271|DX275|DX277|DX278|DX279|DX295|DX298|DX310|DX315|DX332|DX335|DX342|DX343|DX344|DX356|DX385|DX386|DX399|DX404 8|0|1|6|0|0|0|0|0|0|0|0|5|0|3|0|0|6|0|453|0|0|0|2|0|0|6|0|0|0|9|4|6|0|0|1|1|0|9|0|0|41|81 0|4|0|0|0|4|1|0|53|0|0|2|0|0|1|0|0|0|0|0|0|4|0|0|0|0|3|0|0|0|0|0|11|0|4|0|0|0|0|0|7|0|0 0|0|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0 0|0|0|0|1|0|0|0|0|0|0|0|0|0|9|0|0|3|0|0|0|0|0|0|0|0|0|2|0|0|2|0|25|0|0|0|0|0|0|0|2|0|0

Output: `[-0. -0. -0. 0. 0. -0. 0. -0. -0. 0. 0. -0. 0. -0. nan -0.] [ 0. 0. -0. 0. -0. -0. 0. -0. 0. 0. -0. -0. 0. 0. nan -0.] [ 0. 0. 0. 0. 0. -0. -0. 0. 0. -0. -0. 0. -0. 0. nan -0.]

Overall tc: nan

Traceback (most recent call last): File "vis_corex.py", line 777, in n_cpu=options.cpu, ram=options.ram).fit(X_prev)) File "/home/usr/bio_corex/corex.py", line 171, in fit self.fit_transform(X) File "/home/usr/bio_corex/corex.py", line 220, in fit_transform self.dict = best_dict UnboundLocalError: local variable 'best_dict' referenced before assignment`

gregversteeg commented 5 years ago

Oh, that's disappointing. That error is caused by the "nan" in the output for TC (it's trying to find the best TC value, but "nan" is not comparable). If you put --verbose=2 you can see the TCs as you are running - then you might be able to see a nan arise earlier and stop it. That question is, what causes the "nan"? Here are a few ideas to check for:

Not an issue, but you should add the option --no_row_names, since your first column is not an index.

Another possibility for your dataset is to "bin" the data and treat it as discrete. So for instance, you might set 0: 0, 1:1, 2: (any number greater than 1). Then run without the -c option (c to treat as continuous).

gregversteeg commented 5 years ago

One other suggestion.

This looks like count data. I've always meant to include a specific handling of count data, but haven't yet. One thing that works well for count data is to transform each value to log_2(1+x). The 0's and 1's stay the same, but the long tail of high counts is compressed inward. This makes the numerical modeling easier by reducing outliers.

prasanna224 commented 5 years ago

Thanks for your quick response. We will try the suggestions you have outlined here.