Better clustering - Githubissues

platers commented 2 years ago

Some limitations to the current algorithm

No control over min cluster size
Hacky addition to merge small clusters
Opaque, no source code
Written in 2012

BradKML commented 2 years ago

If you desire better and variable clustering try reimplementing algorithms from https://github.com/GiulioRossetti/cdlib

platers commented 2 years ago

Thanks for the suggestion, I'm not familiar with the field. I might do some more experiments later. For anyone into clustering, I'm happy to help you set up the project so you can experiment.

BradKML commented 2 years ago

@platers could you describe the data for me? Is it "it is recommended that people who watches A also watches B" type of deal? If so can you send in a CSV table with the first column being A, second column being B, and recommendation strength being the third column? I can go from there then.

platers commented 2 years ago

Yup thats exactly what the data is. I preprocess it so the edges are undirected and normalized to mostly between 0-1.

the current clustering algo inputs a txt file that looks like https://pastebin.com/VQBGBKqV

each line is an edge: source target weight

the ids are MAL ids.

The output is a hierarchical clustering. I use one of the higher levels to color the nodes.

FYI I just implemented merging seasons and added Anilist data so the data will change a lot soon.

BradKML commented 2 years ago

Currently we can test with

from pandas import read_csv
from networkx import from_pandas_edgelist
df = read_csv("https://pastebin.com/raw/VQBGBKqV",sep=" ",names=["node_a","node_b","weight"])
G = from_pandas_edgelist(df, "node_a", "node_b", ["weight"])

And then from here we can cluster with the algos https://cdlib.readthedocs.io/en/latest/reference/cd_algorithms/node_clustering.html

With finding the result data with communities in NodeClustering https://cdlib.readthedocs.io/en/latest/reference/classes/node_clustering.html

platers commented 2 years ago

https://pastebin.com/uJhTei7H

Updated edges with anilist data and merging. The new clustering is live now, I think its way better than before!

BradKML commented 2 years ago

Is it possible to make a table with multiple edge weights (based on which sites vs aggregate)? CDLib and others can allow you to select which "layer" (website) ratings.

platers commented 2 years ago

Take a look at https://github.com/platers/MAL-Map/blob/master/data-collection/shows.ts

I also updated the contribution instructions in the readme, so hopefully its easier to modify and run the pipeline now.

BradKML commented 2 years ago

@platers if I am using Python programs, how can I easily transfer the data to TS? CSV on file or JSON API?

A small note is that every clustering algorithm always return as a list of lists, each having differing amounts of Anime IDs.

platers commented 2 years ago

Right now I'm using a java .jar file to cluster. See how I call that and get the output from the txt it generates in TS. https://github.com/platers/MAL-Map/blob/master/data-collection/cluster.ts

You will have to rewrite the code for parsing the clusters, since the jar outputs a different format. https://apiacoa.org/research/software/graph/clustering.txt

BradKML commented 2 years ago

Try running this, with that I also have the result

from pandas import read_csv, DataFrame
from networkx import from_pandas_edgelist
df = read_csv("https://pastebin.com/raw/uJhTei7H",
              sep=" ",names=["node_a", "node_b", "weight"])
G = from_pandas_edgelist(df, "node_a", "node_b", ["weight"])

!pip install cdlib
from cdlib import algorithms

results = [algorithms.cpm(G), algorithms.chinesewhispers(G), 
           algorithms.eigenvector(G), algorithms.greedy_modularity(G), 
           algorithms.infomap(G), algorithms.label_propagation(G), 
           algorithms.leiden(G), algorithms.markov_clustering(G), 
           algorithms.mcode(G), algorithms.paris(G), algorithms.rber_pots(G), 
           algorithms.rb_pots(G), algorithms.significance_communities(G), 
           algorithms.spinglass(G), algorithms.surprise_communities(G), 
           algorithms.walktrap(G)]

def clean(x):
  result = {}
  for index, value in enumerate(x):
    for items in value:
      result[items] = index
  return result

DataFrame([clean(i.communities) for i in results]).to_csv("data.csv",sep="\t")

data_set.csv

The amount of clusters are, in order [1699, 330, 8, 9, 4, 100, 13, 29, 13, 255, 14, 1698, 41]

BradKML commented 2 years ago

A small report on selection: algorithms.markov_clustering(G), algorithms.mcode(G), algorithms.paris(G) has internal coherence issues. algorithms.head_tail(G), algorithms.gdmp2(G), algorithms.kcut(G) is about a minute long algorithms.belief(G), algorithms.ga(G) is explosively slow

# async_fluid, em requires default K
# agdl, der, r_spectral_clustering, scan, spectral requires more parameters
# edmot, lswl, mod_* documentation unclear

# %time algorithms.gemsec(G) buggy
# %time algorithms.lswl_plus(G) buggy
# %time algorithms.pycombo(G) buggy
# %time algorithms.ricci_community(G) buggy
# %time algorithms.sbm_dl(G) buggy
# %time algorithms.sbm_dl_nested(G) buggy
# %time algorithms.scd(G) buggy
# %time algorithms.threshold_clustering(G) buggy

platers commented 2 years ago

Is each column a different algorithm? And each clustering is a single level? I can make it work, but the coloring and layout algorithms wont work well.

BradKML commented 2 years ago

@platers each column is a different algorithmic result, yes. Should I relabel it such that there is a header for each algorithm?

platers commented 2 years ago

Ok got it, is there a way to generate multilevel clusters? The frontend depends on it. Maybe you can recursively run the clustering algorithm with the clusters as nodes?

BradKML commented 2 years ago

@platers some of the algorithms are recursive be default, or has a K factor that gives you options. Will need to recheck.

BradKML commented 2 years ago

there are these three that does direct hierarchical clustering, but will double check through installing extra Python libraries.

BradKML commented 2 years ago

Currently Girvan-Newman, Infomap, hSBM, and Paris are the only algorithms currently available for direct hierarchies, but CDLib does not support it.

Alternatively, Proximity Preserving Node Embedding in KarateClub, and its combination with clustering, can be used. https://karateclub.readthedocs.io/en/latest/modules/root.html#neighbourhood-based-node-embedding

@platers The major problem is how can one define resolution from hierarchical clustering, assuming one can slide up and down the resolution? https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering This will evolve into something requiring multiple parameters:

Embedding algorithms
Dimensions needed for the embedding
Hierarchical clustering algorithm
Depth of clustering

platers commented 2 years ago

This is getting complicated enough that it sounds like it should be its own package. Would you be interested in working on a npm clustering package? I can clean up my code to make the API more explicit.

BradKML commented 2 years ago

The API should be way more explicit, but at the same IDK about how input the NPM would provide for the Python code.

platers commented 2 years ago

Currently Girvan-Newman, Infomap, hSBM, and Paris are the only algorithms currently available for direct hierarchies, but CDLib does not support it.

Alternatively, Proximity Preserving Node Embedding in KarateClub, and its combination with clustering, can be used. https://karateclub.readthedocs.io/en/latest/modules/root.html#neighbourhood-based-node-embedding

@platers The major problem is how can one define resolution from hierarchical clustering, assuming one can slide up and down the resolution? https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering This will evolve into something requiring multiple parameters:

Embedding algorithms

Dimensions needed for the embedding

Hierarchical clustering algorithm

Depth of clustering

@BrandonKMLee What do you mean those algorithms aren't supported by CDLib? Why are they in the documentation?

platers commented 2 years ago

@BrandonKMLee I replaced the java clustering with cdlib https://github.com/platers/MAL-Map/blob/cdlib/data-collection/cluster.py

Its called in clusters.ts

I just picked a random algorithm, it should be easy to experiment now.

BradKML commented 2 years ago

@platers thanks for the work, this trims down a lot of work, will patch up some functions so that they can be swapped around (some algos are more unweildy so yeah)

platers commented 2 years ago

Using infomap in cdlb

platers / MAL-Map

Better clustering #3