Issue while running vis_corex

prasanna224 commented 5 years ago

While running a file with the following arguments, I am getting an error after 24 hours of script run time.

Command:

python3 vis_corex.py /home/ppandey/dx_desc.csv --delimiter="|" --layers=32,16,8,1 --dim_hi dden=3 --missing=-1e6 -c -b -v -o dxm --ram=72 --cpu=36

Sample File:

DX101|DX110|DX115|DX118|DX142|DX143|DX155|DX160|DX166|DX169|DX175|DX184|DX196|DX212|DX215|DX218|DX222|DX223|DX234|DX235|DX239|DX253|DX254|DX267|DX271|DX275|DX277|DX278|DX279|DX295|DX298|DX310|DX315|DX332|DX335|DX342|DX343|DX344|DX356|DX385|DX386|DX399|DX404 8|0|1|6|0|0|0|0|0|0|0|0|5|0|3|0|0|6|0|453|0|0|0|2|0|0|6|0|0|0|9|4|6|0|0|1|1|0|9|0|0|41|81 0|4|0|0|0|4|1|0|53|0|0|2|0|0|1|0|0|0|0|0|0|4|0|0|0|0|3|0|0|0|0|0|11|0|4|0|0|0|0|0|7|0|0 0|0|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0 0|0|0|0|1|0|0|0|0|0|0|0|0|0|9|0|0|3|0|0|0|0|0|0|0|0|0|2|0|0|2|0|25|0|0|0|0|0|0|0|2|0|0

Output: `[-0. -0. -0. 0. 0. -0. 0. -0. -0. 0. 0. -0. 0. -0. nan -0.] [ 0. 0. -0. 0. -0. -0. 0. -0. 0. 0. -0. -0. 0. 0. nan -0.] [ 0. 0. 0. 0. 0. -0. -0. 0. 0. -0. -0. 0. -0. 0. nan -0.]

Overall tc: nan

Traceback (most recent call last): File "vis_corex.py", line 777, in n_cpu=options.cpu, ram=options.ram).fit(X_prev)) File "/home/usr/bio_corex/corex.py", line 171, in fit self.fit_transform(X) File "/home/usr/bio_corex/corex.py", line 220, in fit_transform self.dict = best_dict UnboundLocalError: local variable 'best_dict' referenced before assignment`

gregversteeg commented 5 years ago

Oh, that's disappointing. That error is caused by the "nan" in the output for TC (it's trying to find the best TC value, but "nan" is not comparable). If you put --verbose=2 you can see the TCs as you are running - then you might be able to see a nan arise earlier and stop it. That question is, what causes the "nan"? Here are a few ideas to check for:

Are there any missing or non-numeric values in your data file? You can fill in missing/non-numeric with some value (I used -1e6) and then set the --missing=-1e6. I would suggest first trying a small simple model --layers=1 or --layers=2 while checking for issues with nans.
Extreme outliers could cause numerical overflow and nans.
If you really only have ~40 variables, you should use smaller models. --layers=10,3,1 for instance. Then look at the TCs and try a larger model --layers=12,4,1. Do the TCs for each layer go up or down? Usually, you see that TCs go up until you get to some optimal size then decrease again.

Not an issue, but you should add the option --no_row_names, since your first column is not an index.

Another possibility for your dataset is to "bin" the data and treat it as discrete. So for instance, you might set 0: 0, 1:1, 2: (any number greater than 1). Then run without the -c option (c to treat as continuous).

gregversteeg commented 5 years ago

One other suggestion.

This looks like count data. I've always meant to include a specific handling of count data, but haven't yet. One thing that works well for count data is to transform each value to log_2(1+x). The 0's and 1's stay the same, but the long tail of high counts is compressed inward. This makes the numerical modeling easier by reducing outliers.

prasanna224 commented 5 years ago

Thanks for your quick response. We will try the suggestions you have outlined here.

gregversteeg / bio_corex

Issue while running vis_corex #15