vitalwarley commented 11 months ago

41

https://arxiv.org/pdf/2006.01615

vitalwarley commented 11 months ago

You are about to run mtcf as a batch (9 trials)
  batch-size: [200, 400, 600]
  device: '0'
  end-lr: 0.0005
  l2-factor: 0.0002
  loss-log-step: 100
  num-epoch: [4, 8, 12]
  output-dir: exp
  root-dir: rfiw2021/Track1
  start-lr: 0.001
  train-dataset-path: rfiw2021/Track1/sample0/train_sort.txt
  val-dataset-path: rfiw2021/Track1/sample0/val_choose.txt
  weights: weights/ms1mv3_arcface_r100_fp16.pth
Continue? (Y/n)

Despachado na RIG2. Logo mais trago detalhes sobre como fiz.

vitalwarley commented 11 months ago

Os experimentos não convergiram, provavelmente por causa da perda que eu estava usando (CrossEntropyLoss). Ajustei o código para usar a BCEWithLogits, que inclusive é a correta pelo artigo.

vitalwarley commented 11 months ago

Mais épocas são necessárias

You are about to stage trials for mtcf as a batch (9 trials)
  batch-size: [200, 400, 600]
  device: '0'
  end-lr: 0.0005
  l2-factor: 0.0002
  loss-log-step: 100
  num-epoch: [16, 20, 24]
  output-dir: exp
  root-dir: rfiw2021/Track1
  start-lr: 0.001
  train-dataset-path: rfiw2021/Track1/sample0/train_sort.txt
  val-dataset-path: rfiw2021/Track1/sample0/val_choose.txt
  weights: weights/ms1mv3_arcface_r100_fp16.pth

vitalwarley commented 11 months ago

Acima epoch_acc é provinda do conjunto de treino, enquanto epoch_auc é provinda do conjunto de validação para escolha do limiar de z (probabilidade de ser kin ou non-kin).

vitalwarley commented 11 months ago

De fato, mais épocas ajudaram. Um batch maior também, dentro da mesma época.

vitalwarley commented 11 months ago

You are about to stage trials for mtcf as a batch (9 trials)
  batch-size: [1024, 2048, 3072]
  device: '0'
  end-lr: 0.0005
  l2-factor: 0.0002
  loss-log-step: 100
  num-epoch: [50, 100, 150]
  output-dir: exp
  root-dir: rfiw2021/Track1
  start-lr: 0.001
  train-dataset-path: rfiw2021/Track1/sample0/train_sort.txt
  val-dataset-path: rfiw2021/Track1/sample0/val_choose.txt
  weights: weights/ms1mv3_arcface_r100_fp16.pth
Continue? (Y/n)

Acho que será o suficiente para resgatar o melhor modelo e avaliar nos demais conjuntos.

vitalwarley commented 11 months ago

Ainda há alguns experimentos em curso, como esse abaixo

Todavia dificilmente vão passar de >0.7. Dado que os autores não reportam AUC, mas apenas acurácia de no test set, não temos como saber se a reprodução foi satisfatória em termos de resultados. Por outro lado, em termos de arquitetura, hiperparâmetros, eu acredito que não foi satisfatória. Listo os motivos abaixo.

Não consegui reproduzir a base usada para treino. Após aumentar a base com a estratégia "Aumento dos pares positivos da mesma geração" e "Aumento dos pares negativos para cada tipo de parentesco -- indiretamente aumentando a variação de gênero e idade nesses pares.", consegui 271k amostras. Os autores citam 249k pares no conjunto de treino.
- 135k amostras de treino após duplicação dos pares da mesma geração. Ao fim, 271k após adicionar uma amostra negativa aleatória por par positivo.
- 6400 amostras de validação (val_choose) após duplicação dos pares da mesma geração. Ao fim, 12.8k após adicionar uma amostra negativa aleatória por par positivo.
- 129k amostras de validação (val) após duplicação dos pares da mesma geração. Ao fim, 258k após adicionar uma amostra negativa aleatória por par positivo.
- Interessante que os autores citam 129k, sendo que é o meu número inicial. Talvez o SOTA2021 tenha modificado a validação original?
Mesmo com uma base diferente, a princípio, tratei de realizar experimentos com os hiper-parâmetros citados. Ver #41 para mais detalhes. Um dos experimentos segue abaixo
Não é dado os valores para as transformações de contraste, brilho e saturação, logo os defini arbitrariamente em 0.5. Também não ficou claro essas transformações, bem como espelhamento horizontal aleatório, eram aplicadas de uma vez com probabilidade de 0.5, ou se cada uma era aplicada isoladamente com probabilidade 0.5. Decidi por aplicar todas de uma vez.
Usei r101 em vez de r50, dado que já tinha o modelo pré-treinado no MS1Mv3.
Presumi que o valor 0.2 na equação do especialista local representava o parâmetro negative_slope da função de ativação leaky_relu, como fiz aqui

Várias possibilidades poderiam ser implementadas nessa estratégia, mas não sinto que vale a pena o esforço por agora, dado a diferença entre para o AUC nela frente ao SOTA2021.

vitalwarley commented 11 months ago

Resultados do melhor modelo até agora (há um experimento em curso ainda):

Validation: 0.685 de AUC @ 0.155 (limiar de probabilidade)
Test: 0.614 de ACC @ 0.155

Os autores conseguirem 0.736 de acurácia. Logo, não consegui reproduzir os resultados.

Log validation

``` ➜ ours git:(main) ✗ guild run mtcf:val root-dir=rfiw2021/Track1 dataset-path=rfiw2021/Track1/sample0/val.txt output-dir=exp weights=exp/best.pth operation:mtcf=`guild select -Fo mtcf -Sc --max 'epoch_auc'` Refreshing flags... WARNING: cannot import flags from train_fc.py: ModuleNotFoundError: No module named 'dataset' (run with guild --debug for details) WARNING: cannot import flags from train_kv.py: ModuleNotFoundError: No module named 'dataset' (run with guild --debug for details) You are about to run mtcf:val batch-size: 1024 dataset-path: rfiw2021/Track1/sample0/val.txt device: '0' operation:mtcf: aa81ce57e78444e7880b5fa48cbeaa91 output-dir: exp root-dir: rfiw2021/Track1 weights: exp/best.pth Continue? (Y/n) Resolving file:weights Resolving file:../rfiw2021/ Resolving file:models/insightface Resolving operation:mtcf Using run aa81ce57e78444e7880b5fa48cbeaa91 for operation:mtcf 2023-12-23 16:05:20.804376: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2023-12-23 16:05:20.806769: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used. 2023-12-23 16:05:20.847495: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2023-12-23 16:05:20.847521: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2023-12-23 16:05:20.848638: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-12-23 16:05:20.855000: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used. 2023-12-23 16:05:20.855205: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-12-23 16:05:21.648149: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Namespace(root_dir='rfiw2021/Track1', dataset_path='rfiw2021/Track1/sample0/val.txt', weights='exp/best.pth', output_dir=PosixPath('exp'), batch_size=1024, device='0', func=) Current CUDA Device = 0 Device Name = NVIDIA GeForce RTX 3090 Loaded 129032 samples from rfiw2021/Track1/sample0/val.txt (with duplicated samples for same generation bb, ss, sibs). Adding 1 negative samples per sample... Added negative samples, now we have 258064 samples. Validating... ██████████|253/253 [06:31<00:00, 1.55s/it] auc: 0.685 | thresh: 0.155 ```

Log test

``` ➜ ours git:(main) ✗ guild run mtcf:test root-dir=rfiw2021/Track1 dataset-path=rfiw2021/Track1/sample0/test.txt output-dir=exp weights=exp/best.pth operation:mtcf=`guild select -Fo mtcf -Sc --max 'epoch_auc'` threshold=0.155 Refreshing flags... WARNING: cannot import flags from train_fc.py: ModuleNotFoundError: No module named 'dataset' (run with guild --debug for details) WARNING: cannot import flags from train_kv.py: ModuleNotFoundError: No module named 'dataset' (run with guild --debug for details) You are about to run mtcf:test batch-size: 1024 dataset-path: rfiw2021/Track1/sample0/test.txt device: '0' operation:mtcf: aa81ce57e78444e7880b5fa48cbeaa91 output-dir: exp root-dir: rfiw2021/Track1 threshold: 0.155 weights: exp/best.pth Continue? (Y/n) Resolving file:weights Resolving file:../rfiw2021/ Resolving file:models/insightface Resolving operation:mtcf Using run aa81ce57e78444e7880b5fa48cbeaa91 for operation:mtcf 2023-12-23 16:13:54.248687: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2023-12-23 16:13:54.251028: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used. 2023-12-23 16:13:54.291826: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2023-12-23 16:13:54.291850: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2023-12-23 16:13:54.292940: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-12-23 16:13:54.299159: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used. 2023-12-23 16:13:54.299352: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-12-23 16:13:55.072722: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Namespace(root_dir='rfiw2021/Track1', dataset_path='rfiw2021/Track1/sample0/test.txt', weights='exp/best.pth', threshold=0.155, output_dir=PosixPath('exp'), batch_size=1024, device='0', func=) Current CUDA Device = 0 Device Name = NVIDIA GeForce RTX 3090 Loaded 39743 samples from rfiw2021/Track1/sample0/test.txt. Validating... ██████████|39/39 [01:04<00:00, 1.65s/it] acc: 0.614 ```

vitalwarley / research

Reproduzir "A Multi-Task Comparator Framework for Kinship Verification" #49

41