vitalwarley / research

3 stars 0 forks source link

Criar novo plot t-SNE para pares positivos e negativos #44

Closed vitalwarley closed 11 months ago

vitalwarley commented 11 months ago

O objetivo é avaliar a geometria das embeddings apenas para os tipos de pares: positivos e negativos.

Continuação da #29.

vitalwarley commented 11 months ago

Vou precisar retreinar o modelo. Aparentemente deletei-o.

vitalwarley commented 11 months ago

Registrando...

python Track1/train.py --batch_size 25 --sample Track1/sample0  --save_path Track1/model_track1.pth --epochs 80 --beta 0.08 --log_path Track1/log_rig1.txt --gpu 0
... ~ 1h
python Track1/find.py --batch_size 40 --sample Track1/sample0  --save_path Track1/model_track1.pth --log_path Track1/logs_rig1.txt --gpu 0
auc :  0.8611047681264501
threshold : 0.11043223738670349

Similar à última vez.

vitalwarley commented 11 months ago

Em vez de criar um novo plot, vou avaliar como criar um plot interativo, onde

vitalwarley commented 11 months ago

image image

Não é o interativo que eu queria, mas talvez melhor. É útil manter o plot no notebook. Em azul, kin; em vermelho, non-kin. Podemos ver que apenas 5 famílias foram selecionadas. Mais que isso e o plot fica meio ilegível, como vemos abaixo para 10 famílias aleatórias. Em ambos os casos condiciono as famílias da face 1.

image

https://github.com/vitalwarley/research/assets/6365065/c3cb7f21-e807-49f3-852c-231bdc5b1177

Parece que os clusters dentro dos pares com real parentesco convergem para um tipo de parentesco apenas? Vale investigar.

Próximo passo é iterar mais vezes com cuml.

vitalwarley commented 11 months ago
Setting up data...
(array(['sibs', 'sibs', 'sibs', ..., 'gmgd', 'gmgd', 'gmgd'], dtype='<U21'), array([  7,   7,   7, ..., 333, 986, 736]), array([  7,   7,   7, ..., 443, 199, 990]))
Selecting families [250, 283, 409, 735, 873] from face1.
Selected 3642 embeddings.
Selected families for Face1 (5): [250 283 409 735 873]
Selected families for Face2 (186): [   7   22   40   44   53   63   65   71   74   87   91  109  112  114
  119  123  126  133  139  147  148  159  162  165  167  169  170  172
  176  182  183  199  200  205  220  226  233  236  238  240  243  245
  250  266  278  281  283  287  290  304  309  311  312  317  324  328
  330  333  342  344  351  358  360  370  384  386  390  398  407  409
  417  421  422  427  431  438  440  443  446  448  450  457  463  468
  470  481  487  488  490  500  505  510  511  513  516  520  522  530
  531  534  538  547  562  563  568  573  575  581  603  608  617  620
  621  627  632  633  644  649  652  660  663  665  666  667  669  674
  679  681  689  693  697  705  709  713  719  724  728  731  735  736
  750  752  755  758  766  769  785  791  797  800  809  815  826  831
  832  833  836  841  853  858  871  872  873  879  893  905  910  912
  914  916  917  919  921  925  927  930  931  939  943  970  982  986
  990  996  999 1004]
Mean individuals per family for Face1: 4.167048054919908
SD individuals per family for Face1: 80.33680254547787
Mean individuals per family for Face2: 3.6238805970149253
SD individuals per family for Face2: 27.21039870851381
Kinship mapping: {'bb': 0, 'fd': 1, 'fs': 2, 'gfgd': 3, 'gfgs': 4, 'gmgd': 5, 'gmgs': 6, 'md': 7, 'ms': 8, 'sibs': 9, 'ss': 10}
Count kinship relations for Face1: [651 316 576  72 110  75  82 325 479 605 351]
Count kinship relations for Face2: [651 316 576  72 110  75  82 325 479 605 351]
Setting up t-SNE...
TSNE()
[D] [18:24:30.134621] /__w/cuml/cuml/python/_skbuild/linux-x86_64-3.10/cmake-build/cuml/internals/logger.cxx:5234 Learning rate is adaptive. In TSNE paper, it has been shown that as n->inf, Barnes Hut works well if n_neighbors->30, learning_rate->20000, early_exaggeration->24.
[D] [18:24:30.134658] /__w/cuml/cuml/python/_skbuild/linux-x86_64-3.10/cmake-build/cuml/internals/logger.cxx:5234 cuML uses an adpative method.n_neighbors decreases to 30 as n->inf. Likewise for the other params.
[D] [18:24:30.134676] /__w/cuml/cuml/python/_skbuild/linux-x86_64-3.10/cmake-build/cuml/internals/logger.cxx:5234 New n_neighbors = 97, learning_rate = 1214.0, exaggeration = 12.0
[D] [18:24:30.134712] /__w/cuml/cuml/cpp/src/tsne/tsne_runner.cuh:69 Data size = (3642, 1024) with dim = 2 perplexity = 50.000000
[W] [18:24:30.134719] # of Nearest Neighbors should be at least 3 * perplexity. Your results might be a bit strange...
[D] [18:24:30.134728] /__w/cuml/cuml/cpp/src/tsne/tsne_runner.cuh:107 Getting distances.
[D] [18:24:30.142550] /__w/cuml/cuml/cpp/src/tsne/tsne_runner.cuh:142 Now normalizing distances so exp(D) doesn't explode.
[D] [18:24:30.142620] /__w/cuml/cuml/cpp/src/tsne/tsne_runner.cuh:150 Searching for optimal perplexity via bisection search.
[D] [18:24:32.197394] /__w/cuml/cuml/python/_skbuild/linux-x86_64-3.10/cmake-build/cuml/internals/logger.cxx:5234 [t-SNE] KL divergence: 0.5571604371070862

newplot

Podemos ver que há 5 famílias com parentesco verdadeiro (todas provindas da face 1), enquanto que da face 2 temos 186. É interessante porque isso nos mostra como se comportam as embeddings concatenadas. Uma próxima avaliação seria comparar a fusão pela média em vez de concatená-las.

Adicionei também diferentes marcadores para os tipos de parentesco. Alguns se repetem pela limitação dos símbolos se usarmos 3 componentes. Temos a seguinte frequência:

{'bb': 651,
 'sibs': 605,
 'fs': 576,
 'ms': 479,
 'ss': 351,
 'md': 325,
 'fd': 316,
 'gfgs': 110,
 'gmgs': 82,
 'gmgd': 75,
 'gfgd': 72}

Talvez seja interessante, também, limitar os parentescos na filtragem das embeddings.

Setting up data...
(array(['sibs', 'sibs', 'sibs', ..., 'gmgd', 'gmgd', 'gmgd'], dtype='<U21'), array([  7,   7,   7, ..., 333, 986, 736]), array([  7,   7,   7, ..., 443, 199, 990]))
Selecting families [250, 283, 409, 735, 873] from face1.
Selected 3642 embeddings.
Selected families for Face1 (5): [250 283 409 735 873]
Selected families for Face2 (186): [   7   22   40   44   53   63   65   71   74   87   91  109  112  114
  119  123  126  133  139  147  148  159  162  165  167  169  170  172
  176  182  183  199  200  205  220  226  233  236  238  240  243  245
  250  266  278  281  283  287  290  304  309  311  312  317  324  328
  330  333  342  344  351  358  360  370  384  386  390  398  407  409
  417  421  422  427  431  438  440  443  446  448  450  457  463  468
  470  481  487  488  490  500  505  510  511  513  516  520  522  530
  531  534  538  547  562  563  568  573  575  581  603  608  617  620
  621  627  632  633  644  649  652  660  663  665  666  667  669  674
  679  681  689  693  697  705  709  713  719  724  728  731  735  736
  750  752  755  758  766  769  785  791  797  800  809  815  826  831
  832  833  836  841  853  858  871  872  873  879  893  905  910  912
  914  916  917  919  921  925  927  930  931  939  943  970  982  986
  990  996  999 1004]
Mean individuals per family for Face1: 4.167048054919908
SD individuals per family for Face1: 80.33680254547787
Mean individuals per family for Face2: 3.6238805970149253
SD individuals per family for Face2: 27.21039870851381
Kinship mapping: {'bb': 0, 'fd': 1, 'fs': 2, 'gfgd': 3, 'gfgs': 4, 'gmgd': 5, 'gmgs': 6, 'md': 7, 'ms': 8, 'sibs': 9, 'ss': 10}
Count kinship relations for Face1: [651 316 576  72 110  75  82 325 479 605 351]
Count kinship relations for Face2: [651 316 576  72 110  75  82 325 479 605 351]
Setting up t-SNE...
TSNE(init='random', n_components=3, n_iter=5000, perplexity=50, verbose=True)
[t-SNE] Computing 151 nearest neighbors...
[t-SNE] Indexed 3642 samples in 0.001s...
[t-SNE] Computed neighbors for 3642 samples in 0.257s...
[t-SNE] Computed conditional probabilities for sample 1000 / 3642
[t-SNE] Computed conditional probabilities for sample 2000 / 3642
[t-SNE] Computed conditional probabilities for sample 3000 / 3642
[t-SNE] Computed conditional probabilities for sample 3642 / 3642
[t-SNE] Mean sigma: 12.238751
[t-SNE] KL divergence after 250 iterations with early exaggeration: 68.727356
[t-SNE] KL divergence after 2500 iterations: 0.923691

newplot_3d

vitalwarley commented 11 months ago
for perplexity in [20, 50, 100]:
    plot_tsne(n_components=2, fids=[250, 283, 409, 735, 873], kinships=['bb', 'sibs', 'ss'], perplexity=perplexity)
... # segundo plot, com perplexity=50
Setting up data...
(array(['sibs', 'sibs', 'sibs', ..., 'gmgd', 'gmgd', 'gmgd'], dtype='<U21'), array([  7,   7,   7, ..., 333, 986, 736]), array([  7,   7,   7, ..., 443, 199, 990]))
Selecting families [250, 283, 409, 735, 873] from face1.
Selected 3642 embeddings.
Selecting families ['bb', 'sibs', 'ss'] from face1.
Selected 28458 embeddings.
Selected families for Face1 (4): [283 409 735 873]
Selected families for Face2 (137): [  7  53  71  74  87  91 109 112 114 119 123 139 147 162 165 170 172 176
 183 199 200 205 220 226 233 236 238 243 245 278 281 283 287 290 304 309
 311 312 317 330 333 342 344 358 360 390 398 409 417 421 422 427 431 438
 440 443 446 448 450 457 468 470 481 487 488 490 500 505 513 516 520 530
 531 547 563 568 573 575 581 617 620 621 633 644 649 652 660 665 669 674
 681 689 693 709 713 724 728 731 735 736 750 752 755 758 766 769 785 791
 797 815 826 831 832 833 836 841 858 872 873 879 893 910 912 914 919 921
 925 927 930 931 939 943 970 986 990 996 999]
Mean individuals per family for Face1: 1.8386727688787186
SD individuals per family for Face1: 38.24655430645041
Mean individuals per family for Face2: 1.607
SD individuals per family for Face2: 12.913580100034224
Kinship mapping: {'bb': 0, 'sibs': 1, 'ss': 2}
[0 0 0 ... 2 2 2]
Count kinship relations for Face1: [651 605 351]
Count kinship relations for Face2: [651 605 351]
Setting up t-SNE...

Note que temos apenas ~1600 embeddings provindas de famílias específicas (positivas) e dos tipos de parentesco bb, sibs, ss.

https://github.com/vitalwarley/research/assets/6365065/4536a67e-e68c-4f97-afe7-14c4c1b0c0f2

vitalwarley commented 11 months ago
Mean individuals per family for Face1: 1.9405034324942791
SD individuals per family for Face1: 34.261761684206576
Mean individuals per family for Face2: 1.6875621890547263
SD individuals per family for Face2: 12.304345166544856
Kinship mapping: {'fd': 0, 'fs': 1, 'md': 2, 'ms': 3}
Count kinship relations for Face1: [316 576 325 479]
Count kinship relations for Face2: [316 576 325 479]

Similarmente, o plot considerando fd, fs, md, fs

https://github.com/vitalwarley/research/assets/6365065/028e1fc0-6b19-46dc-8165-b2cd6b4b32cf

vitalwarley commented 11 months ago

Agora para gfgd, gfgs, gmgd e gmgs temos ~uma surpresa~ com 2 componentes do t-SNE

image

enquanto que com 3, temos ~algo mais legível~

https://github.com/vitalwarley/research/assets/6365065/54d016b9-7112-401d-a767-888470351c81

Ambos foram com perplexidade de 20, dado o número limitado de amostras (<350).