online-ml / river

🌊 Online machine learning in Python
https://riverml.xyz
BSD 3-Clause "New" or "Revised" License
4.91k stars 538 forks source link

Returning 0 clusters regardless of data (even using examples from documentation) #1466

Open shiraeisenberg opened 7 months ago

shiraeisenberg commented 7 months ago

Versions

river version: installed from github Python version: 3.11 Operating system: Mac OS

Describe the bug

With openai embeddings, Denstream is returning 0 clusters regardless of the set parameters.

Steps/code to reproduce

from schema import db, Post, TiktokVideo, Topic, Subtopic, ResponseCampaign, TiktokVideoHashtag, TiktokQuery
import time
from river import cluster, stream
import requests
from qdrant_client import QdrantClient

client = QdrantClient("localhost", port=6333)

initial_sample = 1000

denstream = cluster.DenStream(decaying_factor=0.01,
                              beta=0.5,
                              mu=2.5,
                              epsilon=0.5,
                              n_samples_init=10)

# dbstream = cluster.DBSTREAM(
#     clustering_threshold=5,
#     fading_factor=0.05,
#     cleanup_interval=40,
#     intersection_factor=0.5,
#     minimum_weight=5
# )

# clustream = cluster.CluStream(
#     n_macro_clusters=30,
#     max_micro_clusters=300,
#     time_gap= 100,
#     time_window = 10000,
#     seed = 42
# )

def get_initial_tiktok_videos():
    try:
        videos = TiktokVideo.select().limit(initial_sample)
        ids = [video.id for video in videos]
        # print("IDs: ", ids)
        tiktok_ids = [video.tiktok_id for video in videos]
        # print("Tiktok IDs: ", tiktok_ids)
        return ids, tiktok_ids
    except TiktokVideo.DoesNotExist:
        print("No Tiktok videos found.")
        return None, None

def get_stream_videos(limit=100, page=10):
    try:
        videos = TiktokVideo.select().limit(limit).offset(page*limit)
        ids = [video.id for video in videos]
        # print("IDs: ", ids)
        tiktok_ids = [video.tiktok_id for video in videos]
        # print("Tiktok IDs: ", tiktok_ids)
        return ids, tiktok_ids
    except TiktokVideo.DoesNotExist:
        print("No Tiktok videos found.")
        return None, None

def fetch_embeddings_from_qdrant(ids):
    try:
        embeddings = []
        for id in ids:
            url = f"http://localhost:6333/collections/raqia_test_collection/points/{id}"
            response = requests.get(url)
            if response.status_code == 200:
                embedding = response.json()['result']['vector']
                # print(embedding)
                embeddings.append(embedding)
        return embeddings
    except Exception as e:
        print(f"Error: {e}")
        return None

if __name__ == "__main__":
    # get_initial_tiktok_videos()
    # get_stream_videos()
    # fetch_embeddings_from_qdrant([0,1])

    initial_set = get_initial_tiktok_videos()[0]
    initial_embeddings = fetch_embeddings_from_qdrant(initial_set)
    for x, _ in stream.iter_array(initial_embeddings):
        # print(x)
        denstream = denstream.learn_one(x)
        print(denstream.n_clusters)
        print(denstream.clusters)
        # dbstream = dbstream.learn_one(x)
        # clustream = clustream.learn_one(x)
        # print(dbstream.n_clusters)
        # print(clustream.centers)
    print(denstream.n_clusters)
    print(denstream.clusters)
    # print(dbstream.n_clusters)
    # print(dbstream.clusters)
    # print(dbstream.centers)
    # print(clustream.centers)
    time.sleep(60)

    while True:
        stream_set = get_stream_videos()[0]
        embeddings = fetch_embeddings_from_qdrant(stream_set)
        for embedding, _ in stream.iter_array(embeddings):
            denstream = denstream.learn_one(embedding)
            # dbstream = dbstream.learn_one(embedding)
            # clustream = clustream.learn_one(embedding)
        # print(denstream.n_clusters)
        # print(denstream.clusters)
        time.sleep(60*5)
hoanganhngo610 commented 7 months ago

Thank you so much for raising the issue @shiraeisenberg. I will have a look into this as soon as possible.

hoanganhngo610 commented 7 months ago

@shiraeisenberg Would you mind giving me the value of your denstream._n_samples_seen? Usually, from my personal experience, DenStream would not return any clusters for approximately 100 first observations. As such, we would usually use 100-1000 first samples as a burn-in, i.e let the model learn without actually requiring any predictions.

nipunagarwala commented 4 months ago

@hoanganhngo610 I see this too. I took the example in the documentation, and re-ran that. If we do not add something like denstream.predict_one({0: -1, 1: -2}) , then n_clusters is always 0.

hoanganhngo610 commented 4 months ago

Dear @nipunagarwala. Thank you very much for your response.

If you look closely into the source code of DenStream and the example within the documentation, you can notice that the learn_one only creates p-micro-clusters and o-micro-clusters, but only when predict_one is called for the first time, the final clusters will be generated and the number of clusters for the solution will be known.

As such, in the example within the documentation, if you retrieve the number of o_micro_clusters and p_micro_clusters by the following commands, the results will be 0 and 2 respectively:

>>> len(denstream.o_micro_clusters
0
>>> len(denstream.p_micro_clusters
2

This philosophy is adopted since we believe that the cluster solutions, or final cluster centers, should only be generated if the command is called since it's extremely computationally expensive; else,only o-micro-clusters and p-micro-clusters will be updated.

Hope that this answer is clear and helpful to you.