opentensor / validators

Repository for bittensor validators
https://www.bittensor.com/
MIT License
14 stars 9 forks source link

Write embeddings to disk for offline analysis #130

Closed steffencruz closed 11 months ago

steffencruz commented 1 year ago

Appends batches of completions and embeddings to a file, which enables offline analysis of embeddings (saving time as we don't need to recalculate embeddings when we perform analysis).

Embeddings are useful for topic modelling and drift detection. This will help us to monitor the effectiveness of relevance and diversity models.

Clearly we can make this more configurable by adding an embedding_path argument to the class which can be used to suppress the behaviour if set to None

Simple Experiment

A unit test will also be added to approximately perform the following:

In [79]: import re
    ...: import pandas as pd
    ...: import torch
    ...: import string
    ...: 
    ...: path = 'embeddings.txt'
    ...: batch_size = 10
    ...: emb_size = 50
    ...: max_seq_len = 100
    ...: vocab = list(string.ascii_lowercase) + [' ','.',',']
    ...: 
    ...: # create a synthetic batch of completions and embeddings
    ...: embeddings = [torch.randn(emb_size) for _ in range(batch_size)]
    ...: completions = [''.join([vocab[ii] for ii in torch.randint(len(vocab),(torch.randint(max_seq_len,(1,)),))]) for _ in range(batch_size)]
    ...: 
    ...: # append batch of  embedding: completion to file
    ...: with open(path,'a') as f:
    ...:     for embedding,completion in zip(embeddings,completions):
    ...:         f.write(f'{embedding.tolist()}: {completion}\n')
    ...: 
    ...: # demonstrate that we can read the file and access embedding
    ...: d = {}
    ...: with open(path,'r') as f:
    ...:     for line in f.readlines():
    ...:         match = re.search('(?P<embedding>\[.*\]):(?P<completion>.*)', line)
    ...:         if match:
    ...:             d[match.groupdict().get('completion')] = match.groupdict().get('embedding')
    ...: s = pd.Series(d)
    ...: df = s.apply(eval).apply(pd.Series).astype(float)
    ...: df
Out[79]: 
                                                          0         1         2         3         4         5   ...        44        45        46        47        48        49
 zuuhvf.bz..tdpelpwjvcsrors kci.ar okbkkpixshkv...  0.741744 -0.862805  1.277946 -1.421647 -1.705466 -0.443937  ...  0.299054 -0.218979 -1.109969 -0.359506 -0.915400 -1.766922
 oz dronmvpdftjdgqlwzizul,iuwpmepbytlnqz,hg,vwk...  1.729253  1.164072  1.137700  0.152922 -1.188301 -1.088406  ...  0.655244  0.362267 -2.823253 -2.302453  1.399678  0.110596
 tgbcci.s,zwnypxykjz wccuklcsnorph,.cwfoh wb,qp...  1.276341  0.240622 -0.276085  0.800949 -1.574409  0.426971  ...  0.393489  1.085670 -0.247537 -1.103877 -1.031841  0.608018
 wbvatyi xdzp bhsrav,plggn.px hklj xn.taj. qwcn... -0.974218 -0.668266  0.780288 -0.639141  1.797099 -1.222686  ... -0.119840 -0.860228  0.537866  0.591747 -0.861042 -2.166075
 wjubnwxgduypwunkfzluxg.hmxfkmnjkwuxwkncuywn tb...  0.747176 -0.077088  3.004401  0.186372  0.743034 -2.106725  ...  0.476450 -0.776715 -1.472374  1.028687  0.094419  0.659627
 ysyk,vrklsneqned,.                                -1.823985 -1.596352 -0.603643 -0.080598 -1.067997  1.039300  ... -0.502950  0.134008 -0.891604 -1.050135  0.061257 -0.542316
 axhinjku,iainfntjao                                0.349668 -1.270654  0.628168  1.238131 -1.030950  0.672218  ...  0.373168  1.372561  1.012306 -0.487119 -1.473551  0.298252
 wapbpcabd.cphipk.ddq                               0.274228 -1.010845 -0.897133 -0.755404  0.760357  1.066223  ... -0.039380  0.847156 -1.087098  0.418657 -0.990457 -0.427856
 ahvvc haqqow rxytgdkwtkr                           0.107779 -0.355674 -0.025883  1.196853  1.755627  0.296679  ...  0.909813  0.367996 -0.555852  2.415046 -0.228443 -0.056342
 wcw,dvahx om                                       0.975540  1.764241  0.699774 -0.611210 -1.534957 -1.757873  ...  0.091626 -1.122119  0.702381 -0.661136  1.847190 -0.900383
 gopkgzvtcl bhaurei apjbduphixxuufswycl  uy gjk...  1.731822 -0.161552 -1.812288 -1.260306  0.302256 -0.087403  ... -0.724787  0.033582  1.097853  0.285895 -0.631340 -0.669583
 clhx.apgfjhvyijvfwtnpwt qtbh.bmu                   1.792319 -0.680207 -1.615576  0.392248 -0.132568  0.066515  ... -0.350620 -1.045586  0.156482 -0.734088 -0.497376  1.200735
 yhozxctozo ,gngxw                                  1.352631 -1.398814  1.025541  0.504703 -0.012212  0.542828  ...  0.362126  0.836864 -0.455345 -1.520990  1.643434  0.213910
  vi,enfb.poitrcqsbj fqptjxxwbutkibrrsvphk  g      -0.974508  0.320787 -0.999676  0.044449 -0.391696 -1.178463  ... -0.112004 -0.633312 -0.230595  0.050695 -1.821269  0.450727
 raynas,yihybmlhqjfzmggebubrre qaegmfptoxnzv t,...  0.110625  0.341678 -0.230905  1.866460 -0.275896 -0.780189  ...  1.396102 -0.766647 -0.998318  0.830169  0.679480  0.379022
 b,vi,                                             -0.713676  1.579101  1.004374 -0.124197 -0.999446  1.567706  ... -0.596651  0.274346  0.774035 -0.379948  0.040279  1.342850
 mb.eroknbis g o,ufpllytoiuyibwbadspootmvyunmkw...  0.990865  0.276213 -1.065938 -0.893349  0.152648  0.016756  ...  0.750412 -0.233488 -0.233033 -0.807460 -2.395447  0.680787
 jlwuwdefxlebjht megxh.opqbg,dpmyt,.aumry.uoohxg   -0.204542 -0.288725 -1.489331  0.502579  1.188192  0.114207  ...  0.234925 -0.106805  1.868982 -0.415197  0.096752 -0.687856
 lohsrxczpuf.grspbkejmvjolgt xmderojawyfxnkh sq... -0.162447 -0.041570  1.639209 -0.171020  0.220670 -0.002051  ...  0.308131 -1.116436  1.708857  0.906514 -0.607612 -0.490218
 gdyzuqslmrdrfo.lwzowiq.orujoyuycjw                -0.480004 -1.332737 -1.669961  0.035516  0.230732  1.481147  ...  1.335589  1.564368 -2.003112  1.048481 -0.808099 -0.124068

[20 rows x 50 columns]