kausmees / GenoCAE

Convolutional autoencoder for genotype data
BSD 3-Clause "New" or "Revised" License
15 stars 10 forks source link

Use 'evaluate' without the `HO_superpopulation` file #24

Closed richelbilderbeek closed 2 years ago

richelbilderbeek commented 2 years ago

This is a note to self, as I cannot assign myself as I am not a Collaborator. Hence I assign myself in text :-)

richelbilderbeek commented 2 years ago

Now this fails with error (from this GHA log):

Traceback (most recent call last):
  File "/home/runner/.local/lib/python3.8/site-packages/numpy/lib/npyio.py", line 1777, in genfromtxt
    fhd = iter(fid)
TypeError: 'NoneType' object is not iterable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_gcae.py", line 1616, in <module>
    main()
  File "run_gcae.py", line 1561, in main
    write_f1_scores_to_csv(results_directory, "epoch_{0}".format(epoch), superpopulations_file, f1_scores_by_pop, coords_by_pop)
  File "/home/runner/work/GenoCAE/GenoCAE/utils/visualization.py", line 603, in write_f1_scores_to_csv
    superpop_pop_dict = get_superpop_pop_dict(superpopulations_file)
  File "/home/runner/work/GenoCAE/GenoCAE/utils/data_handler.py", line 869, in get_superpop_pop_dict
    pop_superpop_list = get_pop_superpop_list(pop_superpop_file)
  File "/home/runner/work/GenoCAE/GenoCAE/utils/data_handler.py", line 626, in get_pop_superpop_list
    pop_superpop_list = np.genfromtxt(file, usecols=(0,1), dtype=str, delimiter=",")
  File "/home/runner/.local/lib/python3.8/site-packages/numpy/lib/npyio.py", line 1779, in genfromtxt
    raise TypeError(
TypeError: fname must be a string, filehandle, list of strings, or generator. Got <class 'NoneType'> instead.

I can imagine the problem is that I trained the CNN too short, all shapes were skipped and hence a division by zero (successfull shapes) occurred:

Imputing originally missing genotypes to most common value.
Evaluating epochs [1, 2, 3]
########################### epoch 1 ###########################
Too few for hull: 1
--  shape: (1, 2): skipping
Too few for hull: 10
--  shape: (1, 2): skipping
Too few for hull: 100
--  shape: (1, 2): skipping
Too few for hull: 101
# [more]
--  shape: (1, 2): skipping
Too few for hull: 97
--  shape: (1, 2): skipping
Too few for hull: 98
--  shape: (1, 2): skipping
Too few for hull: 99
--  shape: (1, 2): skipping
------ hull error : 0.0
------ f1 score with 3NN :0.1807228915662651
richelbilderbeek commented 2 years ago

Also 100 epochs is not sufficient. Trying 1000.

python3 run_gcae.py train --datadir example_tiny --data issue_6_bin --model_id M1  --epochs 1000 --save_interval 1  --train_opts_id ex3  --data_opts_id b_0_4 --pheno_model_id=p1 ; python3 run_gcae.py project --datadir example_tiny --data issue_6_bin --model_id M1 --train_opts_id ex3 --data_opts_id b_0_4 --pheno_model_id=p1 ; python3 run_gcae.py evaluate --metrics "hull_error,f1_score_3" --datadir example_tiny/ --trainedmodelname ae.M1.ex3.b_0_4.issue_6_bin.p1
richelbilderbeek commented 2 years ago

Hmmm, if I run for 1000 epochs with the HO_superpopulations file ...

python3 run_gcae.py train --datadir example_tiny --data issue_6_bin --model_id M1  --epochs 1000 --save_interval 1  --train_opts_id ex3  --data_opts_id b_0_4 --pheno_model_id=p1 ; python3 run_gcae.py project --datadir example_tiny --data issue_6_bin --model_id M1 --train_opts_id ex3 --data_opts_id b_0_4 --pheno_model_id=p1 ; python3 run_gcae.py evaluate --metrics "hull_error,f1_score_3" --datadir example_tiny/ --trainedmodelname ae.M1.ex3.b_0_4.issue_6_bin.p1

... I get a hint:

Too few for hull: Uzbek
--  shape: (1, 2): skipping
Too few for hull: Xibo
--  shape: (1, 2): skipping
Too few for hull: Yakut
--  shape: (1, 2): skipping
Too few for hull: Yemeni
--  shape: (1, 2): skipping
Too few for hull: Yi
--  shape: (1, 2): skipping
Too few for hull: Yoruba
--  shape: (1, 2): skipping
Too few for hull: Yukagir
--  shape: (1, 2): skipping
Too few for hull: Zapotec
--  shape: (1, 2): skipping
------ hull error : 0.0
------ f1 score with 3NN :0.1927710843373494

It means that there are too few individuals per ?family. Aha, it is in the README.md:

hull_error: for every population p: define the convex hull created by the points of samples of p. calculate the fraction that other population's samples make up of all the points inside the hull. the hull error is the average of this over populations.

richelbilderbeek commented 2 years ago

Nope, this has nothing to do with the hull error, as when I run evaluate with ...

python3 run_gcae.py evaluate --metrics "f1_score_3" --datadir example_tiny/ --trainedmodelname ae.M1.ex3.b_0_4.issue_6_bin.p1

I still get:

------ f1 score with 3NN :0.1927710843373494
Traceback (most recent call last):
  File "/home/richel/miniconda3/lib/python3.9/site-packages/numpy/lib/npyio.py", line 1796, in genfromtxt
    fhd = iter(fid)
TypeError: 'NoneType' object is not iterable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/richel/GitHubs/GenoCAE/run_gcae.py", line 1616, in <module>
    main()
  File "/home/richel/GitHubs/GenoCAE/run_gcae.py", line 1561, in main
    write_f1_scores_to_csv(results_directory, "epoch_{0}".format(epoch), superpopulations_file, f1_scores_by_pop, coords_by_pop)
  File "/home/richel/GitHubs/GenoCAE/utils/visualization.py", line 603, in write_f1_scores_to_csv
    superpop_pop_dict = get_superpop_pop_dict(superpopulations_file)
  File "/home/richel/GitHubs/GenoCAE/utils/data_handler.py", line 869, in get_superpop_pop_dict
    pop_superpop_list = get_pop_superpop_list(pop_superpop_file)
  File "/home/richel/GitHubs/GenoCAE/utils/data_handler.py", line 626, in get_pop_superpop_list
    pop_superpop_list = np.genfromtxt(file, usecols=(0,1), dtype=str, delimiter=",")
  File "/home/richel/miniconda3/lib/python3.9/site-packages/numpy/lib/npyio.py", line 1798, in genfromtxt
    raise TypeError(
TypeError: fname must be a string, filehandle, list of strings, or generator. Got <class 'NoneType'> instead.
richelbilderbeek commented 2 years ago

Ah, with the HO_superpop file it does work:

python3 run_gcae.py evaluate --metrics "f1_score_3" --datadir example_tiny/ --trainedmodelname ae.M1.ex3.b_0_4.issue_6_bin.p1       --superpops example_tiny/HO_superpopulations

Gives:

... 2021-12-09 17:51:55.681304: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2021-12-09 17:51:55.681334: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2021-12-09 17:51:58.451444: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected 2021-12-09 17:51:58.451471: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (N141CU): /proc/driver/nvidia/version does not exist 2021-12-09 17:51:58.451766: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. tensorflow version 2.7.0

__ arguments __ train : False datadir : example_tiny/ data : None model_id : None train_opts_id : None data_opts_id : None save_interval : None epochs : None resume_from : None trainedmodeldir : None pheno_model_id : None project : False superpops : example_tiny/HO_superpopulations epoch : None pdata : None trainedmodelname : ae.M1.ex3.b_0_4.issue_6_bin.p1 plot : False animate : False evaluate : True metrics : f1_score_3

__ data opts __ sparsifies : [0.0, 0.1, 0.2, 0.3, 0.4] norm_opts : {'flip': False, 'missing_val': -1.0} norm_mode : genotypewise01 impute_missing : True validation_split : 0.2

__ train opts __ learning_rate : 0.00032 batch_size : 10 noise_std : 0.0032 n_samples : -1 loss : {'module': 'tf.keras.losses', 'class': 'CategoricalCrossentropy', 'args': {'from_logits': False}} regularizer : {'reg_factor': 1e-07, 'module': 'tf.keras.regularizers', 'class': 'l2'} lr_scheme : {'module': 'tf.keras.optimizers.schedules', 'class': 'ExponentialDecay', 'args': {'decay_rate': 0.96, 'decay_steps': 100, 'staircase': False}}


Imputing originally missing genotypes to most common value. Evaluating epochsepoch 1 ########################### ------ f1 score with 3NN :0.1927710843373494 writing f1 score per pop to /home/richel/GitHubs/GenoCAE/ae_out/ae.M1.ex3.b_0_4.issue_6_bin.p1/issue_6_bin/f1_scores_pops_epoch_1.csv ########################### epoch 2 ########################### ------ f1 score with 3NN :0.1686746987951807 writing f1 score per pop to /home/richel/GitHubs/GenoCAE/ae_out/ae.M1.ex3.b_0_4.issue_6_bin.p1/issue_6_bin/f1_scores_pops_epoch_2.csv ########################### epoch 3 ########################### ------ f1 score with 3NN :0.1927710843373494 writing f1 score per pop to /home/richel/GitHubs/GenoCAE/ae_out/ae.M1.ex3.b_0_4.issue_6_bin.p1/issue_6_bin/f1_scores_pops_epoch_3.csv [...] ########################### epoch 998 ########################### ------ f1 score with 3NN :0.1807228915662651 writing f1 score per pop to /home/richel/GitHubs/GenoCAE/ae_out/ae.M1.ex3.b_0_4.issue_6_bin.p1/issue_6_bin/f1_scores_pops_epoch_998.csv ########################### epoch 999 ########################### ------ f1 score with 3NN :0.1927710843373494 writing f1 score per pop to /home/richel/GitHubs/GenoCAE/ae_out/ae.M1.ex3.b_0_4.issue_6_bin.p1/issue_6_bin/f1_scores_pops_epoch_999.csv ########################### epoch 1000 ########################### ------ f1 score with 3NN :0.18674698795180722 writing f1 score per pop to /home/richel/GitHubs/GenoCAE/ae_out/ae.M1.ex3.b_0_4.issue_6_bin.p1/issue_6_bin/f1_scores_pops_epoch_1000.csv

richelbilderbeek commented 2 years ago

Cannot do this, you really need that superpop file :-)