Convolutional autoencoder for genotype data, as described in [1]
An interactive version of the dimensionality reduction visualization from the paper can be found here.
(examples for linux)
python >= 3.6
python3-dev
pip3
$ cd GenoCAE/
$ pip3 install -r requirements.txt
docker build -t gcae/genocae:build -f docker/build.dockerfile .
$ docker run -it --rm -v ${PWD}:/workspace gcae/genocae:build python3 run_gcae.py --help
If you have a Docker with GPU support.
$ docker run -it --gpus=all --rm -v ${PWD}:/workspace gcae/genocae:build python3 run_gcae.py --help
The training and evaluation of models is wrapped by a command-line interface (CLI)
$ cd GenoCAE/
$ python3 run_gcae.py --help
GenoCAE.
Usage:
run_gcae.py train --datadir=<name> --data=<name> --model_id=<name> --train_opts_id=<name> --data_opts_id=<name> --epochs=<num> [--resume_from=<num> --trainedmodeldir=<name> --patience=<num> --save_interval=<num> --start_saving_from=<num> ]
run_gcae.py project --datadir=<name> [ --data=<name> --model_id=<name> --train_opts_id=<name> --data_opts_id=<name> --superpops=<name> --epoch=<num> --trainedmodeldir=<name> --pdata=<name> --trainedmodelname=<name>]
run_gcae.py plot --datadir=<name> [ --data=<name> --model_id=<name> --train_opts_id=<name> --data_opts_id=<name> --superpops=<name> --epoch=<num> --trainedmodeldir=<name> --pdata=<name> --trainedmodelname=<name>]
run_gcae.py animate --datadir=<name> [ --data=<name> --model_id=<name> --train_opts_id=<name> --data_opts_id=<name> --superpops=<name> --epoch=<num> --trainedmodeldir=<name> --pdata=<name> --trainedmodelname=<name>]
run_gcae.py evaluate --datadir=<name> --metrics=<name> [ --data=<name> --model_id=<name> --train_opts_id=<name> --data_opts_id=<name> --superpops=<name> --epoch=<num> --trainedmodeldir=<name> --pdata=<name> --trainedmodelname=<name>]
Options:
-h --help show this screen
--datadir=<name> directory where sample data is stored. if not absolute: assumed relative to GenoCAE/ directory. DEFAULT: data/
--data=<name> file prefix, not including path, of the data files (EIGENSTRAT of PLINK format)
--trainedmodeldir=<name> base path where to save model training directories. if not absolute: assumed relative to GenoCAE/ directory. DEFAULT: ae_out/
--model_id=<name> model id, corresponding to a file models/model_id.json
--train_opts_id=<name>train options id, corresponding to a file train_opts/train_opts_id.json
--data_opts_id=<name> data options id, corresponding to a file data_opts/data_opts_id.json
--epochs<num> number of epochs to train
--resume_from<num> saved epoch to resume training from. set to -1 for latest saved epoch. DEFAULT: None (don't resume)
--save_interval<num> epoch intervals at which to save state of model. DEFAULT: None (don't save)
--start_saving_from<num> number of epochs to train before starting to save model state. DEFAULT: 0.
--trainedmodelname=<name> name of the model training directory to fetch saved model state from when project/plot/evaluating
--pdata=<name> file prefix, not including path, of data to project/plot/evaluate. if not specified, assumed to be the same the model was trained on.
--epoch<num> epoch at which to project/plot/evaluate data. DEFAULT: all saved epochs
--superpops<name> path+filename of file mapping populations to superpopulations. used to color populations of the same superpopulation in similar colors in plotting. if not absolute path: assumed relative to GenoCAE/ directory.
--metrics=<name> the metric(s) to evaluate, e.g. hull_error of f1 score. can pass a list with multiple metrics, e.g. "f1_score_3,f1_score_5". DEFAULT: f1_score_3
--patience=<num> stop training after this number of epochs without improving lowest validation. DEFAULT: None
The main commands are:
Once projected data has been written to disk, the following commands can be used:
See the Examples section below for examples of running these commands.
Setup requires defining the following (using the CLI options):
Defines the actual samples and genotype data to use. Passed to the CLI with option --data
Accepted data formats are
A small example data set HumanOrigins249_tiny is in example_tiny/ with 249 samples and 9259 SNPs. This can be used for local testing.
These options affect how data is input to the model. Passed to the CLI with option --data_opts, specifying a json file in the directory data_opts/
Example: (data_opts/b_0_4.json)
{ "norm_mode" : "genotypewise01",
"norm_opts" : {"flip": false, "missing_val":-1.0},
"impute_missing" : true,
"validation_split" : 0.2,
"sparsifies" : [0.0, 0.1, 0.2, 0.3, 0.4]
}
norm_mode: what normalization mode to use (genotypewise01, smartPCAstyle or standard)
norm_opts: additional normalization options:
impute_missing: if true, genotypes that are missing in the original data are set to the most frequent genotype per marker. if false, then the genotypes that were originally missing are ignored when calculating the loss and genotype concordance. NOTE if this is set to False, then missing_val should be given a value that cannot occur in the data after normalization, so the model can correctly identify which genotypes were originally missing.
validation_split: fraction of samples to use as validation set
Optional:
genotypewise01: normalize genotypes to range [0-1] by mapping 0,1,2 -> 0.0,0.5,1.0
smartPCAstyle: subtract mean and divide with an estimate of the std of population allele frequency (more info in the EIGENSTRAT paper), results in data that is centered and has close to unit variance
standard: subtract mean and divide with std
This defines a model architecture. Passed to the CLI with option --model_id, specifying a json file in the directory models/. The json file defines what layers the model should have, in what order, convolution, stride, pool size etc.
Example model: (models/M1.json)
{"layers":
[
{"class": "Conv1D", "module":"tf.keras.layers", "args": {"filters":8, "kernel_size":5, "padding":"same", "activation":"elu", "strides":1}},
{"class": "BatchNormalization", "module":"tf.keras.layers", "args": {}},
{"class": "ResidualBlock2", "module":"utils.layers", "args": {"filters":8, "kernel_size":5}},
{"class": "MaxPool1D", "module":"tf.keras.layers", "args": {"pool_size":5, "strides":2, "padding":"same"}},
{"class": "Conv1D", "module":"tf.keras.layers", "args": {"filters":8, "kernel_size":5, "padding":"same", "activation":"elu"}},
{"class": "BatchNormalization", "module":"tf.keras.layers", "args": {}},
{"class": "Flatten", "module":"tf.keras.layers", "args": {}},
{"class": "Dropout", "module":"tf.keras.layers", "args": {"rate":0.01}},
{"class": "Dense", "module":"tf.keras.layers", "args": {"units":75}},
{"class": "Dropout", "module":"tf.keras.layers", "args": {"rate":0.01}},
{"class": "Dense", "module":"tf.keras.layers", "args": {"units":75, "activation":"elu"}},
{"class": "Dense", "module":"tf.keras.layers", "encoding" : true, "args": {"units":2, "name":"encoded"}},
{"class": "Dense", "module":"tf.keras.layers", "args": {"units":75, "activation":"elu"}},
{"class": "Dropout", "module":"tf.keras.layers", "args": {"rate":0.01}},
{"class": "Dense", "module":"tf.keras.layers", "args": {"units":75, "activation":"elu"}},
{"class": "Dropout", "module":"tf.keras.layers", "args": {"rate":0.01}},
{"class": "Dense", "module":"tf.keras.layers", "args": {"units":"ns[1]*8"}},
{"class": "Reshape", "module":"tf.keras.layers", "args": {"target_shape":"(ns[1],8)", "name":"i_msvar"}},
{"class": "Conv1D", "module":"tf.keras.layers", "args": {"filters":8, "kernel_size":5, "padding":"same", "activation":"elu"}},
{"class": "BatchNormalization", "module":"tf.keras.layers", "args": {}},
{"class": "Reshape", "module":"tf.keras.layers", "args": {"target_shape":"(ns[1],1,8)"}},
{"class": "UpSampling2D", "module":"tf.keras.layers", "args": {"size":"(2,1)"}},
{"class": "Reshape", "module":"tf.keras.layers", "args": {"target_shape":"(ns[1]*2,8)"}},
{"class": "ResidualBlock2", "module":"utils.layers", "args": {"filters":8, "kernel_size":5}},
{"class": "Conv1D", "module":"tf.keras.layers", "args": {"filters":8, "kernel_size":5, "padding":"same", "activation":"elu", "name":"nms"}},
{"class": "BatchNormalization", "module":"tf.keras.layers", "args": {}},
{"class": "Conv1D", "module":"tf.keras.layers", "args": {"filters":1, "kernel_size":1, "padding":"same"}},
{"class": "Flatten", "module":"tf.keras.layers", "args": {"name":"logits"}}
]
}
Corresponds to the below architecture:
The layers are ordered top-bottom as they are left-right in the autoencoder.
The middle layer should be given the name "encoded", this indicates which layer is the latent representation, or encoding.
Marker-specific variables: In order to facilitate the learning of global patterns in the input data, the there is the option to add so-called marker-specific variables. These contsist of one variable per marker that is updated during the optimization process, allowing the model to capture marker-specific behavior. The two sets of marker-specific variables are illustrated in the figure in red and green. The red set of variables is also concatenated to the model input at every stage of the training process.
Giving a layer the name "i_msvar" denotes where the marker-specific variable that is red in the figure should be concatenated
Giving a layer the name "nms" denotes where the marker-specific variable that is green in the figure should be concatenated
The variable "ns" is a list that contains the size of the length dimension of the data (the dimension that corresponds to the length of the genetic sequence). When a layer that modifies this (e.g. maxpooling) in the encoder is added, the new length is added to ns, so that it can be used to reconstruct the lengths in the decoder (e.g. when upsamling). The values in ns will thus depend on the length of the input data. For the data HumanOrigins249_tiny which is used in the Examples section below, ns=[9259,4630]. When a layer argument such as "units" or "target_shape" that should be an int is specified as a string, it will be evaluated as a python expression (and should therefore be a valid expression that evaluates to an int).
These options affect the training of the model. Passed to the CLI with option --train_opts, specifying a json file in the directory train_opts/
Example with mean squared error loss: (train_opts/ex1.json)
{
"learning_rate": 0.000275,
"batch_size": 100,
"noise_std": 0.005,
"n_samples": -1,
"loss": {
"module": "tf.keras.losses",
"class": "MeanSquaredError"
},
"regularizer": {
"reg_factor": 5e-08,
"module": "tf.keras.regularizers",
"class": "l2"
},
}
Example with binary cross-entropy loss: (train_opts/ex2.json)
{
"learning_rate": 0.000275,
"batch_size": 100,
"noise_std": 0.01,
"n_samples": -1,
"loss": {
"module": "tf.keras.losses",
"class": "BinaryCrossentropy",
"args": {
"from_logits": true
}
},
"regularizer": {
"reg_factor": 1.5e-07,
"module": "tf.keras.regularizers",
"class": "l2"
}
}
Example with categorical cross-entropy loss and a learning rate scheme: (train_opts/ex3.json)
{
"learning_rate": 3.2e-02,
"batch_size": 100,
"noise_std": 0.0032,
"n_samples": -1,
"loss": {
"module": "tf.keras.losses",
"class": "CategoricalCrossentropy",
"args": {
"from_logits": false}},
"regularizer": {
"reg_factor": 1.0e-07,
"module": "tf.keras.regularizers",
"class": "l2"
},
"lr_scheme": {
"module": "tf.keras.optimizers.schedules",
"class": "ExponentialDecay",
"args": {
"decay_rate": 0.98,
"decay_steps": 100,
"staircase": false}}
}
optional:
if the loss is sigmoid_cross_entropy_with_logits, the normalized genotypes should be in range (0,1) - so normalized using e.g. genotypewise01
if the loss is mean_squared_error, the normalized genotypes can be in any range, can use smartPCAstyle normalization
normalization mode is specified in the data_opts file
The training procedure saves the current state of the model in the directory weights/ in the training directory. Since these files can take up a significant amount of space, there are arguments to control how this is done:
If specified, the model state is saved every save_interval epochs.
If a validation set is used, whenever a minimum validation loss is encountered, the weights are saved, and the weights of the previously saved lowest validation epoch are deleted.
For both of the above: the argument start_saving_from specifies a number of epochs to train before starting to save model state.
When stopping training (either because the specified number of epochs has been reached, or because patience number of epochs have passed without reaching a new lowest validation loss), the current model state is saved.
Command to train a model on the example data :
$ cd GenoCAE/
$ python3 run_gcae.py train --datadir example_tiny/ --data HumanOrigins249_tiny --model_id M1 --epochs 20 --save_interval 2 --train_opts_id ex3 --data_opts_id b_0_4
This creates a model training directory: ae_out/ae.M1.ex3.b_0_4.HumanOrigins249_tiny/ with subdirectories
The following files are also created:
You can install tensorboard to use their suite of web tools for inspecting TensorFlow runs.
Tensorboard can be started using:
$ tensorboard --logdir ae_out/ae.M1.ex3.b_0_4.HumanOrigins249_tiny/
it will be displayed on localhost:6006 in the browser
The saved model weights in a model training directory are used to reload the model at each epoch, and project a given data set (= calcaulate the encoding / latent representation). The entire given data set is projected, no validation set is defined.
The data set to project is specified using the pdata argument. If not specified, the same set as was used for training is assumed.
$ cd GenoCAE/
$ python3 run_gcae.py project --datadir example_tiny/ --data HumanOrigins249_tiny --model_id M1 --train_opts_id ex3 --data_opts_id b_0_4 --superpops example_tiny/HO_superpopulations
This creates a directory named after the projected data containing:
When projecting/plotting/evaluating: the location to look for a trained model can either be specified with the same options given to the train commamd (model_id, data_opts_id, etc.) OR by giving the entire directory name with the --trainedmodelname argument. e.g.
$ cd GenoCAE/
$ python3 run_gcae.py project --datadir example_tiny/ --trainedmodelname ae.M1.ex3.b_0_4.HumanOrigins249_tiny --superpops example_tiny/HO_superpopulations
The encoded data per epoch that is stored in the file encoded_data.h5 created by the project command is plotted (generating the same plots as the project command).
$ cd GenoCAE/
$ python3 run_gcae.py plot --datadir example_tiny/ --trainedmodelname ae.M1.ex3.b_0_4.HumanOrigins249_tiny --superpops example_tiny/HO_superpopulations
An animation visualizing the dimensionality reduction over the saved epochs encoded_data.h5 is created.
$ cd GenoCAE/
$ python3 run_gcae.py animate --datadir example_tiny/ --trainedmodelname ae.M1.ex3.b_0_4.HumanOrigins249_tiny --superpops example_tiny/HO_superpopulations
The encoded data per epoch that is stored in the file encoded_data.h5 created by the project command is used to calculate various metrics. The metrics are passed using the --metrics option. Currently implemented metrics are hull_error and f1_score_k, both of which are a measure of how well the encoding clusters populations.
$ cd GenoCAE/
$ python3 run_gcae.py evaluate --metrics "hull_error,f1_score_3" --datadir example_tiny/ --trainedmodelname ae.M1.ex3.b_0_4.HumanOrigins249_tiny --superpops example_tiny/HO_superpopulations
For each metric, a csv and pdf plot of the metric (averaged over all populations) per epoch is created.
For f1-scores, a csv file with the per-population f1-score is also created, one file per epoch.
metrics:
hull_error: for every population p: define the convex hull created by the points of samples of p. calculate the fraction that other population's samples make up of all the points inside the hull. the hull error is the average of this over populations.
f1_score_k: define a k-NN model for population classification based on the dimensionality reduction. get the f1-score of the classification model, micro-averaged over populations.
Dimensionality reduction:
Legend:
See tips for some rules of thumb for model architecture settings and hyperparameter tuning that might be a good place to start for using GCAE on your own data.