siradam / DataMining_Project

0 stars 1 forks source link

Use relational database for clustering #29

Closed lorenzznerol closed 2 years ago

lorenzznerol commented 3 years ago

implement the clustering issue results by putting an index on the cluster as well to zoom quickly into that cluster

lorenzznerol commented 3 years ago

Up to now, I have not found out how to use Python on the database itself - inside PostgreSQL. I could install the language and activate it. But the server does not run the kmeans query, and this is likely a question of the right Python sub-version, that is, which 3.7.x is installed. PostgreSQL 13 surely needs 3.7, though, that is for sure. That is why I think of implementing the kmeans only with SQL, instead, with the help of https://www.sisense.com/blog/multi-dimensional-clustering-using-k-means-postgres/. But this is just a workaround, since it was thought to implement more complicated code as well in plpython3u on PostgreSQL. If this all does not work, another step could be to skip PostgreSQL and go to a big data solution directly. This would be of value for the project chair anyway. But for now, I go on with PostgreSQL.

lorenzznerol commented 3 years ago

See "Think big (scalability)" comment. https://github.com/siradam/DataMining_Project/issues/13#issuecomment-852114288

That is why this approach is now dropped and closed, going on with #17.

lorenzznerol commented 3 years ago

This gets reopened since #13 and #38 show that time-scale (postgresql for time-series) can well be an optimized database for the problem at hand, offering automatic optimization and the chance to use a webserver for free. #17 is not excluded from the solutions, but postponed now that plpython3 can be used.

lorenzznerol commented 3 years ago

Starting point, see #13:


I changed the kmeans test function so that it returns not a pickle dump, but a table (a merger of the df and the new column for the kmeans cluster).

CREATE OR replace FUNCTION kmeans3(input_table text, columns text[], clus_num int) RETURNS table(lon float, lat float, k float) AS

$$

from pandas import DataFrame
from sklearn.cluster import KMeans
#from pickle import dumps

all_columns = ",".join(columns)
if all_columns == "":
    all_columns = "*"

rv = plpy.execute('SELECT %s FROM %s;' % (all_columns, plpy.quote_ident(input_table)))

frame = []

for i in rv:
    frame.append(i)
#df = DataFrame(frame).convert_objects(convert_numeric =True)
#df = pandas.to_numeric(DataFrame(frame))
df = DataFrame(frame).astype(float)
print(df.shape)
kmeans = KMeans(n_clusters=clus_num, random_state=0).fit(df._get_numeric_data())
df['kmeans'] = kmeans.labels_ #.astype(float)
return df.values

$$ LANGUAGE plpython3u;

You can ask for the results with:

SELECT * FROM kmeans3('stokes', ARRAY['lon', 'lat'],3);
        lon        |        lat         | k
-------------------+--------------------+---
 5.171320915222168 | 43.288516998291016 | 0
 4.982897758483887 |  43.29656219482422 | 2
 4.962841033935547 |  43.29465103149414 | 2
 5.100956439971924 |  43.28042221069336 | 0
 5.134312629699707 |  43.29485321044922 | 0
 5.279866695404053 | 43.274662017822266 | 1
 5.095328330993652 |  43.30269241333008 | 0
 5.301522731781006 |  43.29117202758789 | 1
 5.205143928527832 |  43.30012512207031 | 1
(9 rows)

Strangely, it seems necessary to have k column as float in the return value, although there are clearly just integers in it. Typecast to int was not accepted. But it must be possible to export other data types to the same table. Small TODO.


To save the table result to a postgres table directly, either create the table in advance and insert:

create table tab_kmeans1(lon float, lat float, k float);
insert * into tab_kmeans1 SELECT * FROM kmeans3('stokes', ARRAY['lon', 'lat'],3);

or create a new table from the output table:

select * into tab_kmeans1 FROM kmeans3('stokes', ARRAY['lon', 'lat'],3);