Open simonw opened 1 year ago
Type "help", "copyright", "credits" or "license" for more information.
>>> from InstructorEmbedding import INSTRUCTOR
>>> model = INSTRUCTOR('hkunlp/instructor-xl')
Downloading (…)7f436/.gitattributes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1.48k/1.48k [00:00<00:00, 2.78MB/s]
Downloading (…)_Pooling/config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [00:00<00:00, 3.28MB/s]
Downloading (…)/2_Dense/config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 689kB/s]
Downloading pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3.15M/3.15M [00:00<00:00, 12.6MB/s]
Downloading (…)0daf57f436/README.md: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 66.3k/66.3k [00:00<00:00, 12.7MB/s]
Downloading (…)af57f436/config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1.52k/1.52k [00:00<00:00, 5.80MB/s]
Downloading (…)ce_transformers.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 122/122 [00:00<00:00, 140kB/s]
Downloading pytorch_model.bin: 11%|███████████▎ | 545M/4.96G [00:14<01:56, 38.1MB/s]
>>> sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
>>> instruction = "Represent the Science title:"
>>> embeddings = model.encode([[instruction,sentence]])
>>> embeddings
array([[ 1.07386056e-02, 2.03883853e-02, -3.30800918e-04,
-2.47166920e-02, -4.76301350e-02, -4.68175821e-02,
>>> type(embeddings)
<class 'numpy.ndarray'>
>>> type(embeddings[0])
<class 'numpy.ndarray'>
>>> len(embeddings[0])
768
What to do with that instruction
? Can these things be compared if they were embedded with different instructions?
Does this mean a collection of embeddings in LLM should store its instruction too?
No, it looks like those instructions can be compared even if they are different. And you can use them for clustering too.
>>> import sklearn.cluster
>>> sentences = [['Represent the Medicine sentence for clustering: ','Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity'],
... ['Represent the Medicine sentence for clustering: ','Comparison of Atmospheric Neutrino Flux Calculations at Low Energies'],
... ['Represent the Medicine sentence for clustering: ','Fermion Bags in the Massive Gross-Neveu Model'],
... ['Represent the Medicine sentence for clustering: ',"QCD corrections to Associated t-tbar-H production at the Tevatron"],
... ['Represent the Medicine sentence for clustering: ','A New Analysis of the R Measurements: Resonance Parameters of the Higher, Vector States of Charmonium']]
>>> embeddings = model.encode(sentences)
>>> clustering_model = sklearn.cluster.MiniBatchKMeans(n_clusters=2)
>>> clustering_model.fit(embeddings)
/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:1930: FutureWarning: The default value of `n_init` will change from 3 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=3)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
MiniBatchKMeans(n_clusters=2)
>>> cluster_assignment = clustering_model.labels_
>>> print(cluster_assignment)
[0 0 1 0 0]
>>>
>>> len(cluster_assignment)
5
>>> type(cluster_assignment)
<class 'numpy.ndarray'>
>>> cluster_assignment[0]
0
Try against my TILs.
>>> import sqlite_utils
>>> db = sqlite_utils.Database("/tmp/tils.db")
>>> tils = list(db.query('select path, title || " " || body as text from til'))
>>> instruction = 'Represent the blog entry:'
>>> embeddings = model.encode([[instruction, til["text"]] for til in tils])
This has taken over a minute so far. Not sure how long it will take.
Using about 400% of CPU:
>>> model
INSTRUCTOR(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: T5EncoderModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
(2): Dense({'in_features': 1024, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
(3): Normalize()
)
I should have timed embedding one blog entry first.
Once this finishes I'm going to try clustering them.
clustering_model = sklearn.cluster.MiniBatchKMeans(n_clusters=10)
clustering_model.fit(embeddings)
Then try to map that back to the original TILs in the list and dump out the paths and titles for each cluster.
Tried a micro-benchmark and it took 3s to do one entry:
>>> tils = list(db.query('select path, title || " " || body as text from til'))
>>> instruction = 'Represent the blog entry:'
>>>
>>> tils = tils[:1]
>>> len(tils)
1
>>> import time
>>> start = time.time(); embeddings = model.encode([[instruction, til["text"]] for til in tils]); end = time.time();
>>> end - start
3.0543160438537598
So 450 entries should take 1350s = 22.5 minutes.
>>> cluster_assignment = clustering_model.labels_
>>> cluster_assignment
array([8, 3, 6, 5, 6, 9, 1, 1, 6, 2, 1, 1, 1, 6, 6, 5, 6, 6, 6, 5, 6, 8,
6, 5, 6, 5, 6, 6, 5, 5, 8, 6, 2, 4, 7, 1, 1, 7, 1, 1, 7, 4, 6, 5,
4, 0, 0, 1, 4, 9, 8, 8, 3, 3, 3, 2, 9, 5, 4, 2, 2, 2, 2, 2, 2, 2,
2, 8, 4, 1, 3, 8, 1, 2, 6, 5, 1, 5, 7, 1, 6, 3, 8, 6, 0, 6, 1, 8,
5, 8, 5, 6, 5, 5, 1, 1, 1, 6, 3, 3, 6, 7, 4, 0, 1, 5, 3, 6, 5, 7,
9, 9, 9, 1, 1, 9, 5, 1, 6, 5, 1, 5, 5, 1, 5, 7, 8, 7, 7, 7, 6, 6,
6, 7, 4, 4, 5, 3, 7, 7, 7, 4, 5, 8, 7, 7, 5, 7, 6, 8, 4, 8, 7, 7,
7, 4, 7, 6, 8, 6, 1, 7, 9, 1, 8, 4, 6, 8, 4, 1, 4, 1, 9, 0, 1, 5,
9, 7, 9, 6, 1, 5, 1, 7, 8, 7, 4, 3, 3, 2, 4, 6, 7, 3, 4, 3, 6, 1,
2, 3, 6, 4, 6, 4, 8, 6, 1, 6, 4, 4, 3, 2, 1, 5, 2, 6, 1, 5, 1, 5,
5, 4, 2, 4, 3, 4, 6, 1, 6, 8, 5, 7, 4, 9, 5, 3, 4, 6, 0, 6, 7, 3,
6, 1, 8, 4, 0, 0, 5, 1, 6, 9, 1, 8, 6, 7, 7, 3, 2, 9, 3, 4, 6, 3,
5, 6, 6, 4, 1, 9, 9, 4, 5, 1, 1, 4, 2, 1, 2, 3, 6, 4, 4, 6, 4, 5,
1, 6, 1, 1, 4, 1, 2, 2, 3, 3, 6, 6, 6, 6, 1, 2, 6, 7, 4, 1, 4, 8,
9, 6, 9, 7, 6, 7, 1, 6, 9, 6, 2, 9, 1, 1, 7, 1, 6, 9, 2, 4, 8, 0,
9, 4, 9, 6, 8, 4, 6, 4, 6, 1, 1, 2, 6, 2, 4, 4, 4, 5, 5, 8, 2, 2,
8, 6, 8, 6, 6, 6, 0, 8, 6, 1, 7, 7, 6, 0, 9, 7, 1, 8, 6, 2, 7, 6,
7, 7, 9, 6, 6, 4, 7, 4, 7, 4, 9, 4, 6, 3, 9, 9, 8, 0, 4, 4, 9, 2,
3, 8, 4, 9, 1, 3, 1, 7, 9, 6, 9, 6, 6, 3, 9, 6, 3, 3, 1, 6, 9, 4,
7, 4, 1, 9, 5, 2, 5, 6, 2, 4, 4, 5, 5, 6, 4, 2, 6, 3, 1, 6, 5, 9,
3, 5, 8, 6, 3, 2, 1, 5, 8, 6, 5, 1, 3, 2, 1, 1, 1, 9], dtype=int32)
>>> paths_and_titles = list(db.query('select path, title from til'))
>>> len(paths_and_titles), paths_and_titles[0]
(458, {'path': 'svg_dynamic-line-chart.md', 'title': 'Creating a dynamic line chart with SVG'})
{'path': 'svg_dynamic-line-chart.md', 'title': 'Creating a dynamic line chart with SVG', 'cluster': 8}
>>> for cluster, p in zip(cluster_assignment, paths_and_titles):
... p["cluster"] = int(cluster)
...
>>> paths_and_titles[0]
>>> sqlite_utils.Database("/tmp/clusters.db")["clusters"].insert_all(paths_and_titles, pk="path")
<Table clusters2 (path, title, cluster)>
Without the int(cluster)
you get binary data in the database, because it's a int32
and not a Python integer.
Got more interesting results with 20 clusters instead of 10.
I think I'll turn part of this into a llm-cluster
plugin.
Tried naming them like this:
clusters = list(cluster_db.query("select group_concat(title, ', ') as titles from clusters3 group by cluster"))
for cluster in clusters:
print(chatgpt.prompt(cluster['titles'], system="A short name for this cluster of articles").text())
Got this:
Python on macOS Catalina
Geospatial Data and Analysis with SQLite
Cloud Operations
SQLite Tips and Tricks
macOS Development Tools and Techniques
Docker and DevOps Toolbox
GitHub Actions and Git Operations
Python Packaging and Development
Bash Troubleshooting and Useful Commands
Python Article Cluster
JSON Data Manipulation
Tech Topics
"PostgreSQL in Django Admin: Full-Text Search, Read-Only Access, Timezone Display, Bulk Deletions, and More"
JavaScript Tooling and Techniques
GitHub GraphQL API with Python
Testing and Mocking Techniques in Python
"Developer Tool and Integration Tutorials"
Web Development Tips
Tech Tips and Recipes
SQLite Functions and Queries
So variable quality! "Tech Topics" is pretty bad.
One challenge here is that each prompt is separate, so it might come up with the same vague title more than once.
Maybe try and cram the whole lot in a single prompt?
The "Geospatial Data and Analysis with SQLite" one was good though - it consisted of these articles (I put an X next to the ones that didn't seem as good a fit):
https://huggingface.co/hkunlp/instructor-xl
Also interesting: https://alex.macrocosm.so/download