vertica / VerticaPy

VerticaPy is a Python library that exposes sci-kit like functionality to conduct data science projects on data stored in Vertica, thus taking advantage Vertica’s speed and built-in analytics and machine learning capabilities.
https://www.vertica.com/python/
Apache License 2.0
219 stars 45 forks source link

problem SEEDED_RANDOM function #451

Closed lepelletieralexandre closed 1 year ago

lepelletieralexandre commented 1 year ago

Hello,

the methods vdataframe.sample() and vdataframe.train_test_split() (and peraphs other) use "SEEDED_RANDOM" Vertica function. But there is a problem on distributivity of this function.

take as an example the dataset "titanic" :

from verticapy.datasets import load_titanic load_titanic(schema = schema, name = table_read, cursor = cursor) base = vDataFrame(schema + ".titanic" ,cursor)

If we randomly draw samples without replacement from this dataset, we have hypergeom distribution for "survived" feature. If len(sample) = 1/2 len(dataset), we have this distribution for "survived" feature :

from scipy.stats import hypergeom

import matplotlib.pyplot as plt import numpy as np

x_min = 180 x_max = 270

n_defaut_tot = len(base.search("survived = 1")) N = len(base) p = n_defaut_tot/N n = N//2

x = [] for i in range(n_defaut_tot) : x.append(i)

rv = hypergeom(N, n_defaut_tot, n) x = np.arange(0, n+1) pmf_dogs = rv.pmf(x)

plt.plot(x, pmf_dogs, 'bo', ms=8, label='hypergeom pmf')

plt.grid()

plt.xlim(x_min,x_max)

plt.title('hypergeom distribution',fontsize=10)

plt.xlabel('x') plt.ylabel('hypergeom Distribution') plt.show()

image

But with "sample" methods we have this distribution :

tableau_croise = pd.DataFrame([[0 for i in range(2)] for j in range(nb_echantillon)], columns = ["nb_sains", "nb_defauts"]) if len(base) !=0 : for j in range(nb_echantillon) : base_sample = base.sample(x=0.5) tableau_croise.loc[j] = [len(base_sample.search("survived = 0")), len(base_sample.search("survived = 1"))] y, x = np.histogram(tableau_croise["nb_defauts"], bins=30, density=True) x = (x + np.roll(x, -1))[:-1] / 2.0 plt.figure(figsize=(12,8)) plt.hist(tableau_croise["nb_defauts"], bins=30, density=True) plt.title("Nb defauts") plt.show()

image

If I decompose "sample' methods, I use this code :

tableau_croise = pd.DataFrame([[0 for i in range(2)] for j in range(nb_echantillon)], columns = ["nb_sains", "nb_defauts"]) for j in range(nb_echantillon) : vdf = base.copy() name = "test_random" x = 0.5 random_func = "SEEDED_RANDOM({})".format(random.randint(-10e6, 10e6)) vdf.eval(name, random_func) print_info_init = verticapy.options["print_info"] verticapy.options["print_info"] = False vdf.filter("{} <= {}".format(name, x)) verticapy.options["print_info"] = print_info_init tableau_croise.loc[j] = [len(vdf.search("survived = 0")), len(vdf.search("survived = 1"))] y, x = np.histogram(tableau_croise["nb_defauts"], bins=30, density=True) x = (x + np.roll(x, -1))[:-1] / 2.0 plt.figure(figsize=(12,8)) plt.hist(tableau_croise["nb_defauts"], bins=30, density=True) plt.title("Nb defauts") plt.show()

image

This is the same distribution than "sample" method, but it's not the right distribution. If I use the same code, but I replace "SEEDED_RANDOM" function by "RANDOM" function, I have this distribution :

tableau_croise = pd.DataFrame([[0 for i in range(2)] for j in range(nb_echantillon)], columns = ["nb_sains", "nb_defauts"]) for j in range(nb_echantillon) : vdf = base.copy() name = "test_random" x = 0.5 random_func = "RANDOM()" vdf.eval(name, random_func) print_info_init = verticapy.options["print_info"] verticapy.options["print_info"] = False vdf.filter("{} <= {}".format(name, x)) verticapy.options["print_info"] = print_info_init tableau_croise.loc[j] = [len(vdf.search("survived = 0")), len(vdf.search("survived = 1"))] y, x = np.histogram(tableau_croise["nb_defauts"], bins=30, density=True) x = (x + np.roll(x, -1))[:-1] / 2.0 plt.figure(figsize=(12,8)) plt.hist(tableau_croise["nb_defauts"], bins=30, density=True) plt.title("Nb defauts") plt.show()

image

It's the right distribution! I did other tests with python functions and R functions, and I obtain the right result.

So, there is a problem with SEEDED_RANDOM function, but it's a Vertica function and I don't have acces to this code. Can you investigate?

Regards,

Alexandre

oualib commented 1 year ago

Hi Alexandre,

We are on it but it can take some time. Just for your information, the SEEDED_RANDOM function can not be used alone. It needs an id column which will be used to sort the data. Without it, it can give unexpected result.

For example, it is written in the train_test_split doc (order_by param): Without this parameter, the seeded random number used to split the data into train and test can not garanty that no collision occurs. Use this parameter to avoid collisions.

This function is used to get the same sample without having to create extra views or tables and so not polluting the user database. We'll investigate and come back to you.

oualib commented 1 year ago

@glarik or @afard do we have someone who can look at the C++ code of the SEEDED RANDOM function to see if it is correct / and if not how we can improve it?

oualib commented 1 year ago

From the VerticaPy side, I can only add an option by using ordinary 'random' function but in that case the user will need to materialise the generated views. This scenario will remove the 'user friendly' aspect of the function.

I personally don't know how it will influence negatively the output as the seeded_random is used to get one possible split.

@lepelletieralexandre if it is really needed on your side, I think you should open a Vertica ticket to investigate more deeply on the function.

gaetan-dion commented 1 year ago

Hi, After exchanges with our managers, this ticket 02622675 has been created.

afard commented 1 year ago

Because of legacy concerns, we have not modified the in-DB SEEDED_RANDOM function. Instead, we introduce a new function, named DISTRIBUTED_SEEDED_RANDOM, in Vertica 24.1.0. The new function does not have the distribution problem of pseudo random numbers. Similar to the SEEDED_RANDOM function, the new function will not be publicly documented and supported because it still does not meet general expectations of a seeded random function.

The code of VerticaPy is modified so that it will use the new DISTRIBUTED_SEEDED_RANDOM function instead of SEEDED_RANDOM when it is connected to a Vertica server with version bigger than 23.x.

afard commented 1 year ago

This issue is resolved by https://github.com/vertica/VerticaPy/pull/695.