kstaats / karoo_gp

A Genetic Programming platform for Python with TensorFlow for wicked-fast CPU and GPU support.
Other
159 stars 61 forks source link

Tensorflow meta-issue #73

Open granawkins opened 2 years ago

granawkins commented 2 years ago

As part of the engine-api PR, there is an option to choose NumpyEngine or TensorflowEngine. This has led to some discussion about what to do wrt tensorflow. There were several ongoing conversations, so I combine them here.

Outstanding issues:

Finally, it's not obvious to me that tensorflow will ever be faster than numpy for what we're doing. It seems that tensorflow is fast when:

Anyway we should continue to support it, but monitor the performance and make sure users are getting the optimal performance.

asksak commented 2 years ago

Thank you for your technical contribution:

I have the following observations:

Calculating fitness (evaluating trees) in Karoo GP is the responsibility of tensorflow. This is not the performance problem I am facing, as it is done pretty quickly. My problem is crossover:

1.1

(Observation from original version of Karoo GP)

An arithmetic function like

a+(b**c)-sin(d)+ .... however large runs hundreds of time faster in the cross over section than:

IfLargerthan(ifsmallerthanequal(a,b),iflargerthan(e,f) ..... however large

I got logic to work easily in Karoo GP as described in the above equation, however, crossover takes very long to complete even with just a population of 100.

Since Tensorflow evaluates trees, I did not find a difference in speed between Cpu and Gpu regarding the two types of equations.

1.2

A huge notable improvement in crossover time was observed in the new, under development version of Karoo GP which is very promising, however, the generation of a population is so much slower. Regardless, total runtime of the new version is much smaller.

1.3

In the original version, I was easily able to modify the code to make it run under tensorflow 2. It was a little bit faster only.

1.4

For the logic part I had to use iflargerthan to avoid sympy evaluating the equations or part of then even though evaluate was set to FALSE. This is because in sympy:

(a and b and c = c)!!!!! And that is a problem referred to by sympy users.

To make sympy perform its job properly, I had to make sure it got fed strings it cannot identify.

1.5

I will share the parts of the code that made the logic work, as well as the ones that made sympy work. However, believe that, if the new version would generate a population as fast as the original version, Karoo would be ready to incorporate logic, which, in my experience is the most important part of GP.

Will get back to u in a bit with the code.

aymen

asksak commented 2 years ago

In Original Karoo version, the following mods make logical operators work properly:

**This avoids SYMPY evaluation:

'ifgreaterthan': tf.math.greater, 'ifgreaterthanequal': tf.math.greater_equal, 'iflessthan': tf.math.less, 'iflessthanequal': tf.math.less_equal,

**Labels used for logical operators:

if tree[6, node_id] == 'ifgreaterthan' or tree[6, node_id] == 'ifgreaterthanequal' or tree[6, node_id] == 'iflessthan' or tree[6, node_id] == 'iflessthanequal': return tree[6, node_id] + '(' + self.fx_eval_label(tree, tree[9, node_id]) + ',' + self.fx_eval_label(tree, tree[10, node_id]) + ')'

**Tensorflow 2 compatibility:

def fx_fitness_eval(self, expr, data, get_pred_labels=False):

    # 1 - Load data into TF vectors
    tensors = {}
    for i in range(len(self.terminals)):
        var = self.terminals[i]
        # converts data into vectors
        tensors[var] = tf.constant(data[:, i], dtype=tf.float32)

    # 2- Transform string expression into TF operation graph #marker
    result = tf.cast(self.fx_fitness_expr_parse(expr, tensors), dtype=tf.float32)
    pred_labels = tf.no_op() # a placeholder, applies only to CLASSIFY kernel
    solution = tensors['s'] # solution value is assumed to be stored in 's' terminal

    @tf.function
    def sessrun(result, pred_labels, solution): #sessrun is a custom function where the session used to be

        # 3- Add fitness computation into TF graph
        if self.kernel == 'c': # CLASSIFY kernel

            if get_pred_labels: pred_labels = tf.map_fn(self.fx_fitness_labels_map, result, fn_output_signature = (tf.int32, tf.string), swap_memory = True)

            skew = (self.class_labels / 2) - 1

            rule11 = tf.equal(solution, 0)
            rule12 = tf.less_equal(result, 0 - skew)
            rule13 = tf.logical_and(rule11, rule12)

            rule21 = tf.equal(solution, self.class_labels - 1)
            rule22 = tf.greater(result, solution - 1 - skew)
            rule23 = tf.logical_and(rule21, rule22)

            rule31 = tf.less(solution - 1 - skew, result)
            rule32 = tf.less_equal(result, solution - skew)
            rule33 = tf.logical_and(rule31, rule32)

            pairwise_fitness = tf.cast(tf.logical_or(tf.logical_or(rule13, rule23), rule33), tf.int32)

        elif self.kernel == 'r': # REGRESSION kernel

            pairwise_fitness = tf.squared_difference(solution, result)

        elif self.kernel == 'm': # MATCH kernel

            RTOL, ATOL = 1e-05, 1e-08
            pairwise_fitness = tf.cast(tf.less_equal(tf.abs(solution - result), ATOL + RTOL * tf.abs(result)), tf.int32)

        else:
            raise Exception('Kernel type is wrong or missing. You entered {}'.format(self.kernel))

        fitness = tf.reduce_sum(pairwise_fitness)

        return result, pred_labels, solution, fitness, pairwise_fitness

    # Process TF graph and collect the results: sessrun()
    result, pred_labels, solution, fitness, pairwise_fitness = sessrun(result, pred_labels, solution)

    return {'result': result, 'pred_labels': pred_labels, 'solution': solution, 'fitness': fitness, 'pairwise_fitness': pairwise_fitness}

** Logical operators results converted into float, other operators unaffected:

def fx_fitness_node_parse(self, node, tensors):

elif isinstance(node, ast.Call): #e.g. ifgreaterthan(a,b) —> 0 or 1 as float (same result if bool or float)
    return tf.cast(operators[node.func.id](*[self.fx_fitness_node_parse(arg, tensors) for arg in node.args]), tf.float32)
asksak commented 2 years ago

example run result using logical operators:

iflessthanequal(iflessthanequal(iflessthanequal(iflessthanequal(iflessthanequal(A18x, ifgreaterthan(iflessthanequal(ifgreaterthan(iflessthanequal(xmin, A14x), iflessthanequal(A7, A9x)), A30x), iflessthanequal(iflessthanequal(A22x, A26x), ifgreaterthan(iflessthanequal(A30x, A11x), A3x)))), ifgreaterthan(iflessthanequal(A10x, A26x), iflessthanequal(A10x, A17x))), iflessthanequal(iflessthanequal(ifgreaterthan(A5x, A31x), iflessthanequal(iflessthanequal(ifgreaterthan(ifgreaterthan(A11x, xmin), iflessthanequal(A2x, A9x)), ifgreaterthan(ifgreaterthan(A12x, A16x), iflessthanequal(A20x, A18x))), ifgreaterthan(ifgreaterthan(iflessthanequal(A31x, A17x), iflessthanequal(A20x, A9x)), iflessthanequal(iflessthanequal(A4x, A25x), A30x)))), ifgreaterthan(iflessthanequal(ifgreaterthan(iflessthanequal(ifgreaterthan(A11x, A29x), iflessthanequal(A30x, A6x)), A25x), iflessthanequal(A19x, A30x)), iflessthanequal(A19x, ifgreaterthan(A25x, A19x))))), iflessthanequal(ifgreaterthan(A10x, xrange), iflessthanequal(iflessthanequal(iflessthanequal(xmin, A6x), iflessthanequal(iflessthanequal(ifgreaterthan(iflessthanequal(A27x, A9x), iflessthanequal(A9x, A24x)), ifgreaterthan(A29x, ifgreaterthan(xmax, A14x))), iflessthanequal(iflessthanequal(A19x, iflessthanequal(xrange, A15x)), A11x))), A31x))), ifgreaterthan(iflessthanequal(iflessthanequal(iflessthanequal(iflessthanequal(iflessthanequal(iflessthanequal(A14x, iflessthanequal(A12x, A22x)), iflessthanequal(ifgreaterthan(A14x, A9x), A11x)), iflessthanequal(iflessthanequal(ifgreaterthan(A12x, A16x), ifgreaterthan(A31x, A8x)), iflessthanequal(ifgreaterthan(A17x, A27x), ifgreaterthan(A14x, A6x)))), iflessthanequal(iflessthanequal(ifgreaterthan(iflessthanequal(A11x, A22x), iflessthanequal(A15x, A6x)), A10x), A19x)), ifgreaterthan(ifgreaterthan(iflessthanequal(iflessthanequal(A30x, iflessthanequal(A22x, A26x)), iflessthanequal(ifgreaterthan(A13x, A17x), ifgreaterthan(A17x, A27x))), iflessthanequal(iflessthanequal(iflessthanequal(A8x, A6x), ifgreaterthan(A24x, A11x)), ifgreaterthan(iflessthanequal(A4x, A6x), ifgreaterthan(A22x, xmax)))), iflessthanequal(ifgreaterthan(A31x, iflessthanequal(A27x, A20x)), ifgreaterthan(iflessthanequal(A29x, iflessthanequal(A22x, A15x)), iflessthanequal(ifgreaterthan(A18x, A16x), A1x))))), iflessthanequal(iflessthanequal(iflessthanequal(xmin, A6x), ifgreaterthan(ifgreaterthan(ifgreaterthan(iflessthanequal(A9x, xmin), xmin), iflessthanequal(iflessthanequal(A2x, A12x), A31x)), A31x)), ifgreaterthan(ifgreaterthan(ifgreaterthan(A14x, iflessthanequal(xmin, A22x)), iflessthanequal(A24x, A19x)), iflessthanequal(ifgreaterthan(A11x, iflessthanequal(iflessthanequal(A18x, A10x), A17x)), ifgreaterthan(ifgreaterthan(iflessthanequal(A8x, A6x), xmax), iflessthanequal(A23x, iflessthanequal(A8x, A25x))))))), ifgreaterthan(iflessthanequal(ifgreaterthan(A30x, iflessthanequal(iflessthanequal(ifgreaterthan(A20x, xrange), iflessthanequal(A10x, A7)), ifgreaterthan(A12x, A15x))), ifgreaterthan(iflessthanequal(A11x, A25x), iflessthanequal(ifgreaterthan(iflessthanequal(ifgreaterthan(A26x, A15x), iflessthanequal(A3x, A5x)), iflessthanequal(A7, A9x)), ifgreaterthan(A20x, iflessthanequal(ifgreaterthan(A16x, xrange), iflessthanequal(xrange, xmin)))))), A30x)))

final result confirmed 1 or 0

asksak commented 2 years ago

Pardon me for not using Github the way its supposed to be used

asksak commented 2 years ago

Note that I tested my logical operators on the Iris classification and get 100% on the 6th generation.

asksak commented 2 years ago

Summary:

original Karoo:

  1. Fast generation creation
  2. Very slow crossover

New Under Development Karoo:

  1. Slow generation creation
  2. Fast cross over
kstaats commented 2 years ago

Thank you for your tedious testing of the revisions to Karoo. We are pleased to have your support. Stay tuned, as there are yet many changes to come, your observations noted and archived.

On 6/20/22 00:34, asksak wrote:

Summary:

original Karoo:

  1. Fast generation creation
  2. Very slow crossover

New Under Development Karoo:

  1. Slow generation creation
  2. Fast cross over
asksak commented 2 years ago

As part of the engine-api PR, there is an option to choose NumpyEngine or TensorflowEngine. This has led to some discussion about what to do wrt tensorflow. There were several ongoing conversations, so I combine them here.

Outstanding issues:

  • We're using tensorflow 1, which is outdated and partially deprecated, so we should update to tensorflow 2.
  • tf is imported lazily by the TensorflowEngine because (at least for me) it takes about 2s extra to load. So if you use the NumpyEngine instead, you can save those 2 seconds by avoiding importing tf at all. The way I've done it is by copying a LazyLoader class from tensorflow themselves. This seems inelegant, and also may be sensitive to licensing. Should look for a better solution, like maybe importing tf in TensorflowEngine.__init__().
  • @asksak has asked a few questions about tensorflow:

Finally, it's not obvious to me that tensorflow will ever be faster than numpy for what we're doing. It seems that tensorflow is fast when:

  • Working with matrices (2d), while we work with arrays (1d)
  • Doing multiplication or dot-products specifically, while we do many different operations

Anyway we should continue to support it, but monitor the performance and make sure users are getting the optimal performance.

Greetings,

would you please supply me with version numbers of Python, Numpy, and Tensorflow you are using in development. I need to figure out a speed problem.

Thank you

kstaats commented 2 years ago

Thank you Aymen. Grant or Ezio will respond soon with the version numbers, as requested. As for Numpy vs TF, that will be proven when we return to testing against much larger datasets.

On 6/22/22 02:29, asksak wrote:

As part of the engine-api PR, there is an option to choose NumpyEngine or TensorflowEngine. This has led to some discussion about what to do wrt tensorflow. There were several ongoing conversations, so I combine them here.

Outstanding issues:

  • We're using tensorflow 1, which is outdated and partially deprecated, so we should update to tensorflow 2.
  • tf is imported lazily by the TensorflowEngine because (at least for me) it takes about 2s extra to load. So if you use the NumpyEngine instead, you can save those 2 seconds by avoiding importing tf at all. The way I've done it is by copying a LazyLoader class from tensorflow themselves. This seems inelegant, and also may be sensitive to licensing. Should look for a better solution, like maybe importing tf in TensorflowEngine.__init__().
  • @asksak has asked a few questions about tensorflow:

Finally, it's not obvious to me that tensorflow will ever be faster than numpy for what we're doing. It seems that tensorflow is fast when:

  • Working with matrices (2d), while we work with arrays (1d)
  • Doing multiplication or dot-products specifically, while we do many different operations

Anyway we should continue to support it, but monitor the performance and make sure users are getting the optimal performance.

Greetings,

would you please supply me with version numbers of Python, Numpy, and Tensorflow you are using in development. I need to figure out a speed problem.

Thank you

granawkins commented 2 years ago

Hi @asksak - thanks for all the feedback! Apologies for the slow response, I've been deep in Karoo and needed to come up for air :)

As you know, currently we're still using tensorflow v1, just for continuity, and plan to update to v2 soon.

We're using the latest version of Python3 (though any Python3 should work) and the latest numpy (1.23).

I hope to implement logical operators in the next few days, and your comments above are really helpful for that, so stay tuned.

granawkins commented 2 years ago

I did some exploration of how/when tensorflow beats numpy expecting nothing, but I found something! Here's the complete notebook, and a summary is below. All of this was run on an Nvidia Tesla P100 GPU running on Google Cloud.

First off: Eager Execution

To use Tensorflow 1, you have to open a session, compile a graph of functions, and then execute some data on that graph. In Tensorflow 2, you just put your data in tensors (no session needed), and call functions on those tensors as you would numpy arrays. This this was the headline feature of TF2.

From a big-picture Karoo perspective, this means we don't need to collapse Trees into strings and re-build them in a tensorflow graph; we can execute in-place in the Node class. This is a huge reduction in complexity.

So for the demo I write a stripped-down version of our Node component with that in-place execution, and the experiments below show that it's taking proper advantage of the GPU. I'll implement this properly in Karoo in a future PR.

Baseline Comparison

I do a basic linear equation (a+a)/(a*a), repeated N times, on two different sized datasets: (100, 100) and (10,000, 100). Numpy is faster for the smaller set, Tensorflow is faster for the larger set. baseline_comparison

Karoo Comparison

I generate a population of 10 trees, depth=3, and execute the sample data gens=2 times. I test a range of sample sizes, in binary and decimal. Looks like Tensorflow is generally faster for >30,000 samples (at those other settings). karoo_comparison

Conclusion

kstaats commented 2 years ago

Very well done Grant. Excellent! I had heard about the new functions of TF being even easier to implement, but didn't realize how much they had integrated high-level functions. Incredible.

As for the dataset size, this makes perfect sense. While TF may be easier to call, the reality is that at the hardware level, GPUs must (by the very nature of the hardware) have every register filled with each logical execution, including null values if not used. This process of taking any given mathematical expression and breaking it down into register by register allocation was originally (2010s) done by hand in C. Very taxing. Cafe, Torch, and TF came along and enabled non-uber-geeks to work one level higher. Keras turned TF into a proper Python library. And now, it seems, it is even simpler.

The dataset sizes you describe make sense. This is what Marco and I found a few years ago with my first paper on Karoo, that GPUs suffer a hit in spin-up (what I described above) and prep, while Numpy just gets to work. It will be fun to compare numbers, to see where the cross-over used to be versus now, with this new version.

On 8/20/22 19:56, Grant wrote:

I did some exploration of how/when tensorflow beats numpy expecting nothing, but I found something! Here's the complete notebook, and a summary is below. All of this was run on an Nvidia Tesla P100 GPU running on Google Cloud.

First off: Eager Execution

In Tensorflow 1, you have to open a session, compile a graph, and then execute the graph. In Tensorflow 2, you just build and call tensors the same way as numpy. This this was the headline feature of TF2.

From a big-picture Karoo perspective, this means we don't need to collapse Trees into strings and re-build them in a tensorflow session; we can execute in-place in the Node class. This is a huge reduction in complexity.

So for the demo I write a stripped-down version of our Node component with that in-place execution, and the experiments below show that it's taking proper advantage of the GPU. I'll implement this properly in Karoo in a future PR.

Baseline Comparison

I do a basic linear equation (a+a)/(a*a), repeated N times, on two different sized datasets: (100, 100) and (10,000, 100). Numpy is faster for the smaller set, Tensorflow is faster for the larger set. baseline_comparison

Karoo Comparison

I generate a population of 10 trees, depth=3, and execute the sample data gens=2 times. I test a range of sample sizes, in binary and decimal. Looks like Tensorflow is generally faster for >30,000 samples (at those other settings). karoo_comparison

Conclusion

  • Tensorflow and Numpy shine in different settings, so let's include both.
  • The Node in-place execution works for Numpy and Tensorflow2, and is a huge reduction in complexity to boot. So I'll implement that in an upcoming PR. That will incidentally be our update from TF1 to TF2.
kstaats commented 2 years ago

Yes, TF will be faster than Numpy in very large datasets. The original research and paper demonstrated this

https://scholar.google.com/citations?view_op=view_citation&hl=en&user=nYNEzYgAAAAJ&cstart=20&pagesize=80&sortby=pubdate&citation_for_view=nYNEzYgAAAAJ:8k81kl-MbHgC

With the revised code which replaces arrays with objects, it is likely that TF will not be faster until we reach larger datasets than before. A revised research project would need be conducted to discover the tipping point.

To remove TF without this research would be premature.

On 6/19/22 20:44, Grant wrote:

As part of the engine-api PR, there is an option to choose NumpyEngine or TensorflowEngine. This has led to some discussion about what to do wrt tensorflow. There were several ongoing conversations, so I combine them here.

Outstanding issues:

  • We're using tensorflow 1, which is outdated and partially deprecated, so we should update to tensorflow 2.
  • tf is imported lazily by the TensorflowEngine because (at least for me) it takes about 2s extra to load. So if you use the NumpyEngine instead, you can save those 2 seconds by avoiding importing tf at all. The way I've done it is by copying a LazyLoader class from tensorflow themselves. This seems inelegant, and also may be sensitive to licensing. Should look for a better solution, like maybe importing tf in TensorflowEngine.__init__().
  • @asksak has asked a few questions about tensorflow:
    • 39 Utilizing AMD GPU. Sounds like you got it to work, anything we should update?

    • 66 Pointed out the limitations of tensorflow v1; noted.

    • 72 Found a problem with tf.map_fn(); this is removed in engine-api and shouldn't be an issue going forward.

Finally, it's not obvious to me that tensorflow will ever be faster than numpy for what we're doing. It seems that tensorflow is fast when:

  • Working with matrices (2d), while we work with arrays (1d)
  • Doing multiplication or dot-products specifically, while we do many different operations Anyway we should continue to support it, but monitor the performance and make sure users are getting the optimal performance.