update the computation of scores and errors

Hi @xksteven, following on our discussion yesterday #11, I just had the chance to run the changes on actual solutions and there are some problems

Given that we added >0 to average and strict accuracy result[index] needs to be an array for these computations too to avoid this error:
```
per_prob_res.append(np.mean(results[index] > 0))
TypeError: '>' not supported between instances of 'list' and 'int'
```
Setting tmp_results to np.array(res) to do elementwise comparison directly fails with some examples, as tmp_results is the output for problems that might have different number of tests, which makes the second dimension of the « array » not constant making the element wise comparison to fail. (doesn't work either with adding dtype=object)
```
VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
tmp_results = np.array(res)
/home/loubna_huggingface_co/datasets/pr.py:146: DeprecationWarning: elementwise comparison failed; this will raise an error in the future. compile_errors = len(tmp_results[tmp_results==-2])
```
I changed this to use lists, the compilation error happens once per problem [[-2]], the results will be slightly different for runtime errors as in this case we assume if the runtime error happens for one test it also does for the other tests (looking at some examples it seems to be the case) so we count it as one runtime error per problem.

Feel free to run more tests or if you have another suggestion to fix this.

hendrycks / apps