siradam / DataMining_Project

0 stars 1 forks source link

Cluster the main dataset #10

Closed lorenzznerol closed 2 years ago

lorenzznerol commented 3 years ago
lorenzznerol commented 3 years ago

From a development guide:

Think big (scalability)

Keep in mind that - for now - your tool might process just little amounts of data (e.g. 1 GB), but in the long run, the amount of data might become much more (e.g. 10 TB). Hence, when reading data consider what to do if the data is too big for the memory (e.g. read files not at once but in chunks of 10 MB).

This recommended coding practice shows that we will probably have to leave a Python-only analysis and use a lot of the SQL database power, see #13 (or big data tools later in #17). Complex clustering algorithm like DTW, which takes a lot of time and which we can only do with Python, will need an aggregated level, that is, aggregated days/weeks/months, kmeans clusters, perhaps real value types rounded to 1 or 2 decimals, mean values, quantiles and so on.

The prototypes can still use more exact data, that is why this clustering task does not change in its core.

siradam commented 3 years ago

clustering http://ijarcet.org/wp-content/uploads/IJARCET-VOL-4-ISSUE-3-642-648.pdf Considering several clustering methods in order to cluster the data. Each method has its own advantages and disadvantages. k-means might fit for finding the most important locations and attributes. We could also just cluster on location (lon / lat / mpa / perhaps z), then do an isolated clustering on temperature (temp/land), then an isolated clustering on speed (distance/time). When that is done, we might have 3 clusters assigned to each row. And those cluster values could then be the base to do DTW or DBSCAN

siradam commented 3 years ago

Clusters trying to cluster (k-Means) the data (only in March), which not contain time, obs columns, and picked the number of clusters as 5. Next, going to try to cluster data with DTW/DBSCAN

lorenzznerol commented 3 years ago

Just found by accident, perhaps this is interesting for the clustering with many dimensions. Normally, this is used to show the clusters in a multi-dimensional output of ML models.

https://umap-learn.readthedocs.io/en/latest/clustering.html could provide an interesting approach clustering multi-dimensional dataset (so that we would use more than just lon/lat). The page mentions the problems of kmeans that puts too sharp border lines, chooses HDBSCAN, names t-SNE as the rival of umap (but umap seems to be a good choice here as well), and umap thus may give a fast way of using DBSCAN. We must still aggregate the data before we do any clustering, as said in #8.

lorenzznerol commented 3 years ago

"traclus" (clustering trajectories), paper of Lee at http://hanj.cs.illinois.edu/pdf/sigmod07_jglee.pdf

It was mentioned by Carola in the meeting, we should have a look at it.

siradam commented 3 years ago

TraClus can treat 2D trajectories data that contain latitude and longitude. Based on existing implementation https://github.com/apolcyn/traclus_impl
It seems the algorithm will treat each data point as an individual trajectory. which might be not correct in our case, we also need to deal with the time series. However, I and Junaid are working on that. Besides that, DTW is also being tested and this algorithm also has an issue, it couldn't handle the coordination of data. Cannot see the movement path of trajectories. DTW is based on the distance computation of the time series.

In conclusion, I still struggle with finding the best clustering method for our data and the question: What will be next after clustering? I got an idea that we can cluster the data by month and each month we can show the picture of clusters. Then, we can sum up with the movement of trajectories

siradam commented 3 years ago

DBSCAN_starting_points Clusters at the starting point (first day ~first 24 hours.) However, due to lack of computation, couldn't do the clustering with DBSCAN for the whole data set or even for March (~780k data points). My machine is suitable for ~50-70k data points.

KMeans_April Clustering with KMeans in April KMeans_May Clustering with KMeans in May

With KMeans, it's totally fine to run clustering for the whole dataset