simonw / public-notes

Public notes as issue threads
25 stars 0 forks source link

Try Instructor-XL #15

Open simonw opened 1 year ago

simonw commented 1 year ago

https://huggingface.co/hkunlp/instructor-xl

Also interesting: https://alex.macrocosm.so/download

simonw commented 1 year ago
Type "help", "copyright", "credits" or "license" for more information.
>>> from InstructorEmbedding import INSTRUCTOR
>>> model = INSTRUCTOR('hkunlp/instructor-xl')
Downloading (…)7f436/.gitattributes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1.48k/1.48k [00:00<00:00, 2.78MB/s]
Downloading (…)_Pooling/config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [00:00<00:00, 3.28MB/s]
Downloading (…)/2_Dense/config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 689kB/s]
Downloading pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3.15M/3.15M [00:00<00:00, 12.6MB/s]
Downloading (…)0daf57f436/README.md: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 66.3k/66.3k [00:00<00:00, 12.7MB/s]
Downloading (…)af57f436/config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1.52k/1.52k [00:00<00:00, 5.80MB/s]
Downloading (…)ce_transformers.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 122/122 [00:00<00:00, 140kB/s]
Downloading pytorch_model.bin:  11%|███████████▎                                                                                           | 545M/4.96G [00:14<01:56, 38.1MB/s]
simonw commented 1 year ago
>>> sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
>>> instruction = "Represent the Science title:"
>>> embeddings = model.encode([[instruction,sentence]])
>>> embeddings
array([[ 1.07386056e-02,  2.03883853e-02, -3.30800918e-04,
        -2.47166920e-02, -4.76301350e-02, -4.68175821e-02,

>>> type(embeddings)
<class 'numpy.ndarray'>
>>> type(embeddings[0])
<class 'numpy.ndarray'>
>>> len(embeddings[0])
768
simonw commented 1 year ago

What to do with that instruction? Can these things be compared if they were embedded with different instructions?

Does this mean a collection of embeddings in LLM should store its instruction too?

simonw commented 1 year ago

No, it looks like those instructions can be compared even if they are different. And you can use them for clustering too.

>>> import sklearn.cluster
>>> sentences = [['Represent the Medicine sentence for clustering: ','Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity'],
...              ['Represent the Medicine sentence for clustering: ','Comparison of Atmospheric Neutrino Flux Calculations at Low Energies'],
...              ['Represent the Medicine sentence for clustering: ','Fermion Bags in the Massive Gross-Neveu Model'],
...              ['Represent the Medicine sentence for clustering: ',"QCD corrections to Associated t-tbar-H production at the Tevatron"],
...              ['Represent the Medicine sentence for clustering: ','A New Analysis of the R Measurements: Resonance Parameters of the Higher,  Vector States of Charmonium']]
>>> embeddings = model.encode(sentences)
>>> clustering_model = sklearn.cluster.MiniBatchKMeans(n_clusters=2)
>>> clustering_model.fit(embeddings)
/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:1930: FutureWarning: The default value of `n_init` will change from 3 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=3)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
MiniBatchKMeans(n_clusters=2)
>>> cluster_assignment = clustering_model.labels_
>>> print(cluster_assignment)
[0 0 1 0 0]
>>> 
>>> len(cluster_assignment)
5
>>> type(cluster_assignment)
<class 'numpy.ndarray'>
>>> cluster_assignment[0]
0
simonw commented 1 year ago

Try against my TILs.

>>> import sqlite_utils
>>> db = sqlite_utils.Database("/tmp/tils.db")
>>> tils = list(db.query('select path, title || " " || body as text from til'))
>>> instruction = 'Represent the blog entry:'
>>> embeddings = model.encode([[instruction, til["text"]] for til in tils])

This has taken over a minute so far. Not sure how long it will take.

simonw commented 1 year ago

Using about 400% of CPU:

CleanShot 2023-08-28 at 10 21 10@2x
simonw commented 1 year ago
>>> model
INSTRUCTOR(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: T5EncoderModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
  (2): Dense({'in_features': 1024, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (3): Normalize()
)
simonw commented 1 year ago

I should have timed embedding one blog entry first.

simonw commented 1 year ago

Once this finishes I'm going to try clustering them.

clustering_model = sklearn.cluster.MiniBatchKMeans(n_clusters=10)
clustering_model.fit(embeddings)

Then try to map that back to the original TILs in the list and dump out the paths and titles for each cluster.

simonw commented 1 year ago

Tried a micro-benchmark and it took 3s to do one entry:

>>> tils = list(db.query('select path, title || " " || body as text from til'))
>>> instruction = 'Represent the blog entry:'
>>> 
>>> tils = tils[:1]
>>> len(tils)
1
>>> import time
>>> start = time.time(); embeddings = model.encode([[instruction, til["text"]] for til in tils]); end = time.time();
>>> end - start
3.0543160438537598

So 450 entries should take 1350s = 22.5 minutes.

simonw commented 1 year ago
>>> cluster_assignment = clustering_model.labels_
>>> cluster_assignment
array([8, 3, 6, 5, 6, 9, 1, 1, 6, 2, 1, 1, 1, 6, 6, 5, 6, 6, 6, 5, 6, 8,
       6, 5, 6, 5, 6, 6, 5, 5, 8, 6, 2, 4, 7, 1, 1, 7, 1, 1, 7, 4, 6, 5,
       4, 0, 0, 1, 4, 9, 8, 8, 3, 3, 3, 2, 9, 5, 4, 2, 2, 2, 2, 2, 2, 2,
       2, 8, 4, 1, 3, 8, 1, 2, 6, 5, 1, 5, 7, 1, 6, 3, 8, 6, 0, 6, 1, 8,
       5, 8, 5, 6, 5, 5, 1, 1, 1, 6, 3, 3, 6, 7, 4, 0, 1, 5, 3, 6, 5, 7,
       9, 9, 9, 1, 1, 9, 5, 1, 6, 5, 1, 5, 5, 1, 5, 7, 8, 7, 7, 7, 6, 6,
       6, 7, 4, 4, 5, 3, 7, 7, 7, 4, 5, 8, 7, 7, 5, 7, 6, 8, 4, 8, 7, 7,
       7, 4, 7, 6, 8, 6, 1, 7, 9, 1, 8, 4, 6, 8, 4, 1, 4, 1, 9, 0, 1, 5,
       9, 7, 9, 6, 1, 5, 1, 7, 8, 7, 4, 3, 3, 2, 4, 6, 7, 3, 4, 3, 6, 1,
       2, 3, 6, 4, 6, 4, 8, 6, 1, 6, 4, 4, 3, 2, 1, 5, 2, 6, 1, 5, 1, 5,
       5, 4, 2, 4, 3, 4, 6, 1, 6, 8, 5, 7, 4, 9, 5, 3, 4, 6, 0, 6, 7, 3,
       6, 1, 8, 4, 0, 0, 5, 1, 6, 9, 1, 8, 6, 7, 7, 3, 2, 9, 3, 4, 6, 3,
       5, 6, 6, 4, 1, 9, 9, 4, 5, 1, 1, 4, 2, 1, 2, 3, 6, 4, 4, 6, 4, 5,
       1, 6, 1, 1, 4, 1, 2, 2, 3, 3, 6, 6, 6, 6, 1, 2, 6, 7, 4, 1, 4, 8,
       9, 6, 9, 7, 6, 7, 1, 6, 9, 6, 2, 9, 1, 1, 7, 1, 6, 9, 2, 4, 8, 0,
       9, 4, 9, 6, 8, 4, 6, 4, 6, 1, 1, 2, 6, 2, 4, 4, 4, 5, 5, 8, 2, 2,
       8, 6, 8, 6, 6, 6, 0, 8, 6, 1, 7, 7, 6, 0, 9, 7, 1, 8, 6, 2, 7, 6,
       7, 7, 9, 6, 6, 4, 7, 4, 7, 4, 9, 4, 6, 3, 9, 9, 8, 0, 4, 4, 9, 2,
       3, 8, 4, 9, 1, 3, 1, 7, 9, 6, 9, 6, 6, 3, 9, 6, 3, 3, 1, 6, 9, 4,
       7, 4, 1, 9, 5, 2, 5, 6, 2, 4, 4, 5, 5, 6, 4, 2, 6, 3, 1, 6, 5, 9,
       3, 5, 8, 6, 3, 2, 1, 5, 8, 6, 5, 1, 3, 2, 1, 1, 1, 9], dtype=int32)
simonw commented 1 year ago
>>> paths_and_titles = list(db.query('select path, title from til'))
>>> len(paths_and_titles), paths_and_titles[0]
(458, {'path': 'svg_dynamic-line-chart.md', 'title': 'Creating a dynamic line chart with SVG'})
{'path': 'svg_dynamic-line-chart.md', 'title': 'Creating a dynamic line chart with SVG', 'cluster': 8}
>>> for cluster, p in zip(cluster_assignment, paths_and_titles):
...   p["cluster"] = int(cluster)
... 
>>> paths_and_titles[0]
>>> sqlite_utils.Database("/tmp/clusters.db")["clusters"].insert_all(paths_and_titles, pk="path")
<Table clusters2 (path, title, cluster)>

Without the int(cluster) you get binary data in the database, because it's a int32 and not a Python integer.

simonw commented 1 year ago

Got more interesting results with 20 clusters instead of 10.

I think I'll turn part of this into a llm-cluster plugin.

simonw commented 1 year ago

Tried naming them like this:

clusters = list(cluster_db.query("select group_concat(title, ', ') as titles from clusters3 group by cluster"))
for cluster in clusters:
    print(chatgpt.prompt(cluster['titles'], system="A short name for this cluster of articles").text())

Got this:

Python on macOS Catalina
Geospatial Data and Analysis with SQLite
Cloud Operations
SQLite Tips and Tricks
macOS Development Tools and Techniques
Docker and DevOps Toolbox
GitHub Actions and Git Operations
Python Packaging and Development
Bash Troubleshooting and Useful Commands
Python Article Cluster
JSON Data Manipulation
Tech Topics
"PostgreSQL in Django Admin: Full-Text Search, Read-Only Access, Timezone Display, Bulk Deletions, and More"
JavaScript Tooling and Techniques
GitHub GraphQL API with Python
Testing and Mocking Techniques in Python
"Developer Tool and Integration Tutorials"
Web Development Tips
Tech Tips and Recipes
SQLite Functions and Queries
simonw commented 1 year ago

So variable quality! "Tech Topics" is pretty bad.

One challenge here is that each prompt is separate, so it might come up with the same vague title more than once.

Maybe try and cram the whole lot in a single prompt?

simonw commented 1 year ago

The "Geospatial Data and Analysis with SQLite" one was good though - it consisted of these articles (I put an X next to the ones that didn't seem as good a fit):

simonw commented 1 year ago

Plugin: https://github.com/simonw/llm-cluster