Closed vitalwarley closed 11 months ago
Vou precisar retreinar o modelo. Aparentemente deletei-o.
Registrando...
python Track1/train.py --batch_size 25 --sample Track1/sample0 --save_path Track1/model_track1.pth --epochs 80 --beta 0.08 --log_path Track1/log_rig1.txt --gpu 0
... ~ 1h
python Track1/find.py --batch_size 40 --sample Track1/sample0 --save_path Track1/model_track1.pth --log_path Track1/logs_rig1.txt --gpu 0
auc : 0.8611047681264501
threshold : 0.11043223738670349
Similar à última vez.
Em vez de criar um novo plot, vou avaliar como criar um plot interativo, onde
Não é o interativo que eu queria, mas talvez melhor. É útil manter o plot no notebook. Em azul, kin; em vermelho, non-kin. Podemos ver que apenas 5 famílias foram selecionadas. Mais que isso e o plot fica meio ilegível, como vemos abaixo para 10 famílias aleatórias. Em ambos os casos condiciono as famílias da face 1.
https://github.com/vitalwarley/research/assets/6365065/c3cb7f21-e807-49f3-852c-231bdc5b1177
Parece que os clusters dentro dos pares com real parentesco convergem para um tipo de parentesco apenas? Vale investigar.
Próximo passo é iterar mais vezes com cuml.
Setting up data...
(array(['sibs', 'sibs', 'sibs', ..., 'gmgd', 'gmgd', 'gmgd'], dtype='<U21'), array([ 7, 7, 7, ..., 333, 986, 736]), array([ 7, 7, 7, ..., 443, 199, 990]))
Selecting families [250, 283, 409, 735, 873] from face1.
Selected 3642 embeddings.
Selected families for Face1 (5): [250 283 409 735 873]
Selected families for Face2 (186): [ 7 22 40 44 53 63 65 71 74 87 91 109 112 114
119 123 126 133 139 147 148 159 162 165 167 169 170 172
176 182 183 199 200 205 220 226 233 236 238 240 243 245
250 266 278 281 283 287 290 304 309 311 312 317 324 328
330 333 342 344 351 358 360 370 384 386 390 398 407 409
417 421 422 427 431 438 440 443 446 448 450 457 463 468
470 481 487 488 490 500 505 510 511 513 516 520 522 530
531 534 538 547 562 563 568 573 575 581 603 608 617 620
621 627 632 633 644 649 652 660 663 665 666 667 669 674
679 681 689 693 697 705 709 713 719 724 728 731 735 736
750 752 755 758 766 769 785 791 797 800 809 815 826 831
832 833 836 841 853 858 871 872 873 879 893 905 910 912
914 916 917 919 921 925 927 930 931 939 943 970 982 986
990 996 999 1004]
Mean individuals per family for Face1: 4.167048054919908
SD individuals per family for Face1: 80.33680254547787
Mean individuals per family for Face2: 3.6238805970149253
SD individuals per family for Face2: 27.21039870851381
Kinship mapping: {'bb': 0, 'fd': 1, 'fs': 2, 'gfgd': 3, 'gfgs': 4, 'gmgd': 5, 'gmgs': 6, 'md': 7, 'ms': 8, 'sibs': 9, 'ss': 10}
Count kinship relations for Face1: [651 316 576 72 110 75 82 325 479 605 351]
Count kinship relations for Face2: [651 316 576 72 110 75 82 325 479 605 351]
Setting up t-SNE...
TSNE()
[D] [18:24:30.134621] /__w/cuml/cuml/python/_skbuild/linux-x86_64-3.10/cmake-build/cuml/internals/logger.cxx:5234 Learning rate is adaptive. In TSNE paper, it has been shown that as n->inf, Barnes Hut works well if n_neighbors->30, learning_rate->20000, early_exaggeration->24.
[D] [18:24:30.134658] /__w/cuml/cuml/python/_skbuild/linux-x86_64-3.10/cmake-build/cuml/internals/logger.cxx:5234 cuML uses an adpative method.n_neighbors decreases to 30 as n->inf. Likewise for the other params.
[D] [18:24:30.134676] /__w/cuml/cuml/python/_skbuild/linux-x86_64-3.10/cmake-build/cuml/internals/logger.cxx:5234 New n_neighbors = 97, learning_rate = 1214.0, exaggeration = 12.0
[D] [18:24:30.134712] /__w/cuml/cuml/cpp/src/tsne/tsne_runner.cuh:69 Data size = (3642, 1024) with dim = 2 perplexity = 50.000000
[W] [18:24:30.134719] # of Nearest Neighbors should be at least 3 * perplexity. Your results might be a bit strange...
[D] [18:24:30.134728] /__w/cuml/cuml/cpp/src/tsne/tsne_runner.cuh:107 Getting distances.
[D] [18:24:30.142550] /__w/cuml/cuml/cpp/src/tsne/tsne_runner.cuh:142 Now normalizing distances so exp(D) doesn't explode.
[D] [18:24:30.142620] /__w/cuml/cuml/cpp/src/tsne/tsne_runner.cuh:150 Searching for optimal perplexity via bisection search.
[D] [18:24:32.197394] /__w/cuml/cuml/python/_skbuild/linux-x86_64-3.10/cmake-build/cuml/internals/logger.cxx:5234 [t-SNE] KL divergence: 0.5571604371070862
Podemos ver que há 5 famílias com parentesco verdadeiro (todas provindas da face 1), enquanto que da face 2 temos 186. É interessante porque isso nos mostra como se comportam as embeddings concatenadas. Uma próxima avaliação seria comparar a fusão pela média em vez de concatená-las.
Adicionei também diferentes marcadores para os tipos de parentesco. Alguns se repetem pela limitação dos símbolos se usarmos 3 componentes. Temos a seguinte frequência:
{'bb': 651,
'sibs': 605,
'fs': 576,
'ms': 479,
'ss': 351,
'md': 325,
'fd': 316,
'gfgs': 110,
'gmgs': 82,
'gmgd': 75,
'gfgd': 72}
Talvez seja interessante, também, limitar os parentescos na filtragem das embeddings.
Setting up data...
(array(['sibs', 'sibs', 'sibs', ..., 'gmgd', 'gmgd', 'gmgd'], dtype='<U21'), array([ 7, 7, 7, ..., 333, 986, 736]), array([ 7, 7, 7, ..., 443, 199, 990]))
Selecting families [250, 283, 409, 735, 873] from face1.
Selected 3642 embeddings.
Selected families for Face1 (5): [250 283 409 735 873]
Selected families for Face2 (186): [ 7 22 40 44 53 63 65 71 74 87 91 109 112 114
119 123 126 133 139 147 148 159 162 165 167 169 170 172
176 182 183 199 200 205 220 226 233 236 238 240 243 245
250 266 278 281 283 287 290 304 309 311 312 317 324 328
330 333 342 344 351 358 360 370 384 386 390 398 407 409
417 421 422 427 431 438 440 443 446 448 450 457 463 468
470 481 487 488 490 500 505 510 511 513 516 520 522 530
531 534 538 547 562 563 568 573 575 581 603 608 617 620
621 627 632 633 644 649 652 660 663 665 666 667 669 674
679 681 689 693 697 705 709 713 719 724 728 731 735 736
750 752 755 758 766 769 785 791 797 800 809 815 826 831
832 833 836 841 853 858 871 872 873 879 893 905 910 912
914 916 917 919 921 925 927 930 931 939 943 970 982 986
990 996 999 1004]
Mean individuals per family for Face1: 4.167048054919908
SD individuals per family for Face1: 80.33680254547787
Mean individuals per family for Face2: 3.6238805970149253
SD individuals per family for Face2: 27.21039870851381
Kinship mapping: {'bb': 0, 'fd': 1, 'fs': 2, 'gfgd': 3, 'gfgs': 4, 'gmgd': 5, 'gmgs': 6, 'md': 7, 'ms': 8, 'sibs': 9, 'ss': 10}
Count kinship relations for Face1: [651 316 576 72 110 75 82 325 479 605 351]
Count kinship relations for Face2: [651 316 576 72 110 75 82 325 479 605 351]
Setting up t-SNE...
TSNE(init='random', n_components=3, n_iter=5000, perplexity=50, verbose=True)
[t-SNE] Computing 151 nearest neighbors...
[t-SNE] Indexed 3642 samples in 0.001s...
[t-SNE] Computed neighbors for 3642 samples in 0.257s...
[t-SNE] Computed conditional probabilities for sample 1000 / 3642
[t-SNE] Computed conditional probabilities for sample 2000 / 3642
[t-SNE] Computed conditional probabilities for sample 3000 / 3642
[t-SNE] Computed conditional probabilities for sample 3642 / 3642
[t-SNE] Mean sigma: 12.238751
[t-SNE] KL divergence after 250 iterations with early exaggeration: 68.727356
[t-SNE] KL divergence after 2500 iterations: 0.923691
for perplexity in [20, 50, 100]:
plot_tsne(n_components=2, fids=[250, 283, 409, 735, 873], kinships=['bb', 'sibs', 'ss'], perplexity=perplexity)
... # segundo plot, com perplexity=50
Setting up data...
(array(['sibs', 'sibs', 'sibs', ..., 'gmgd', 'gmgd', 'gmgd'], dtype='<U21'), array([ 7, 7, 7, ..., 333, 986, 736]), array([ 7, 7, 7, ..., 443, 199, 990]))
Selecting families [250, 283, 409, 735, 873] from face1.
Selected 3642 embeddings.
Selecting families ['bb', 'sibs', 'ss'] from face1.
Selected 28458 embeddings.
Selected families for Face1 (4): [283 409 735 873]
Selected families for Face2 (137): [ 7 53 71 74 87 91 109 112 114 119 123 139 147 162 165 170 172 176
183 199 200 205 220 226 233 236 238 243 245 278 281 283 287 290 304 309
311 312 317 330 333 342 344 358 360 390 398 409 417 421 422 427 431 438
440 443 446 448 450 457 468 470 481 487 488 490 500 505 513 516 520 530
531 547 563 568 573 575 581 617 620 621 633 644 649 652 660 665 669 674
681 689 693 709 713 724 728 731 735 736 750 752 755 758 766 769 785 791
797 815 826 831 832 833 836 841 858 872 873 879 893 910 912 914 919 921
925 927 930 931 939 943 970 986 990 996 999]
Mean individuals per family for Face1: 1.8386727688787186
SD individuals per family for Face1: 38.24655430645041
Mean individuals per family for Face2: 1.607
SD individuals per family for Face2: 12.913580100034224
Kinship mapping: {'bb': 0, 'sibs': 1, 'ss': 2}
[0 0 0 ... 2 2 2]
Count kinship relations for Face1: [651 605 351]
Count kinship relations for Face2: [651 605 351]
Setting up t-SNE...
Note que temos apenas ~1600 embeddings provindas de famílias específicas (positivas) e dos tipos de parentesco bb, sibs, ss.
https://github.com/vitalwarley/research/assets/6365065/4536a67e-e68c-4f97-afe7-14c4c1b0c0f2
Mean individuals per family for Face1: 1.9405034324942791
SD individuals per family for Face1: 34.261761684206576
Mean individuals per family for Face2: 1.6875621890547263
SD individuals per family for Face2: 12.304345166544856
Kinship mapping: {'fd': 0, 'fs': 1, 'md': 2, 'ms': 3}
Count kinship relations for Face1: [316 576 325 479]
Count kinship relations for Face2: [316 576 325 479]
Similarmente, o plot considerando fd, fs, md, fs
https://github.com/vitalwarley/research/assets/6365065/028e1fc0-6b19-46dc-8165-b2cd6b4b32cf
Agora para gfgd, gfgs, gmgd e gmgs temos ~uma surpresa~ com 2 componentes do t-SNE
enquanto que com 3, temos ~algo mais legível~
https://github.com/vitalwarley/research/assets/6365065/54d016b9-7112-401d-a767-888470351c81
Ambos foram com perplexidade de 20, dado o número limitado de amostras (<350).
O objetivo é avaliar a geometria das embeddings apenas para os tipos de pares: positivos e negativos.
Continuação da #29.