lilacai / lilac

Curate better data for LLMs
http://lilacml.com
Apache License 2.0
944 stars 89 forks source link

Error while clustering #1196

Open Xirider opened 7 months ago

Xirider commented 7 months ago

After starting the clustering I get this error:

[local/evol1][1 shards] map "extract_text" to "('prompt__cluster',)": 100%|████████████████████████████████████████████████████████████████████████████████| 319/319 [00:00<00:00, 12033.30it/s]
Wrote map output to prompt__cluster-00000-of-00001.parquet
[local/evol1][1 shards] map "cluster_documents" to "('prompt__cluster',)":   0%|                                                                                        | 0/319 [00:00<?, ?it/s]jinaai/jina-embeddings-v2-small-en using device: mps:0
Computing embeddings: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 319/319 [00:22<00:00, 14.27it/s]
Computing embeddings took 27.761s.
/Users/peter/miniconda3/envs/vis/lib/python3.11/site-packages/umap/umap_.py:1943: UserWarning: n_jobs value -1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(f"n_jobs value {self.n_jobs} overridden to 1 by setting random_state. Use no seed for parallelism.")
UMAP: Reducing dim from 512 to 5 of 319 vectors took 2.297s.
HDBSCAN: Clustering took 0.005s.
99 noise points (31.0%) will be assigned to nearest cluster.
HDBSCAN: Computing membership for the noise points took 0.004s.
[local/evol1][1 shards] map "cluster_documents" to "('prompt__cluster',)": 100%|██████████████████████████████████████████████████████████████████████████████| 319/319 [00:32<00:00,  9.89it/s]
Wrote map output to prompt__cluster-00000-of-00001.parquet
[local/evol1][1 shards] map "title_clusters" to "('prompt__cluster',)":   0%|                                                                                           | 0/319 [00:00<?, ?it/s]
Error code: 400 - {'error': {'message': 'you must provide a model parameter', 'type': 'invalid_request_error', 'param': None, 'code': None}}
vanduc103 commented 6 months ago

Hi. For someone who encountered this problem (like me), you need to set the env variable "API_MODEL" to the OpenAI model (GPT-3.5 or GPT-4). This problem relates to OpenAI API calling. Btw, who knows how to set the complete .env file for the project? Thanks!

FrederikHandberg commented 4 months ago

I get the same issue when just using gte small to cluster a dataset.