Closed sophienguyen01 closed 11 months ago
Hi Sophie,
Similarly to my reply in https://github.com/google/deepvariant/issues/698#issuecomment-1711046219, it is best shuffle all the TFRecords across all the samples. If you have bad samples, then skip those. If you want to see the difference among samples, train then individually, but that will generate multiple per-sample-biased models. What you really want is one good representative model of all your samples, as that will be the model that will be presented to future samples. Thus you want as many good representative samples to train that model. That is why you want to shuffle across all your samples.
The make_examples
script only takes one BAM file via the --reads
parameter, and you don't need to merge multiple BAM files into one, as the shuffling happens afterwards on the generated TFRecords. So the process is roughly as follows:
1) Run make_examples
on training set BAMs, and run make_examples
on validation set BAMs -- both of which will generate TFRecords.
2) Shuffle TFRecords for training set, then shuffle the ones for validation set separately.
3) Run model_train
on shuffled training set shuffled data.
4) Run model_eval
on shuffled validation set data to evaluate generated checkpoints, which will generate best_checkpoint.txt
and best_checkpoint.metrics
files.
5) Pick best model listed in the best_checkpoint.txt
file.
6) Test the best model with run_deepvariant
by providing it to the --customized_model
parameter, and for the --reads
parameter setting it with the test data BAM file.
7) Benchmark by comparing your VCF with your Truth set VCF via hap.py
, and check the metrics against a baseline appropriate to your study.
8) Then if you want, you could train/retrain a (new or previous) model by tuning the modeling parameters and/or updating the training data - i.e. if you want to make it more flexible to capture more variety in the input data - both of which might improve the model under different conditions.
As Maria mentioned previously, training is done on chromosome 1-19, then evaluation on 21-22, with a test on 20. Usually all training is done on some data, then evaluated on another for picking the best model, and finally the best model would be tested with the test data.
Let me know if I should expand on anything.
Hope it helps, Paul
Hi @sophienguyen01 ,
if you want to generate examples from multiple samples, you'll need to run make_examples
separately multiple times. And then you can use the shuffling step to put all the examples from multiple samples in the same dataset for training.
Thank you all for your suggestions, I went ahead and shuffle across my training examples. I also done the model_train
and model_eval
as in the tutorial but the evaluation result on HG003_chr20 does not look good at all. The F1 score drop dramatically ( 0 for one trained model and almost 0 for another model).
So far, I kept the same value for paramater inmodel_train
step:
--model_name="inception_v3" \ --number_of_steps=50000 \ --save_interval_secs=300 \ --batch_size=32 \ --learning_rate=0.0005 \
I know that changing these values will affect the evaluation results. I went through DeepVariant paper but couldn't find any information on what parameters were used during the training. which values of the training parameter do you suggest that I can try?
Hi Sophie,
So tuning a model is both an art and a science. Yes, accuracy can be lower initially, as in the following result described by Pi-Chuan in another post:
Thus no training parameters are the same for any model, since the underlying data is not the same and the goals usually are not equal. Therefore the approach for optimizing model is a journey of discovery performed via hyperparameter tuning. For example, Google uses Vizier, but the idea falls into one of five general camps:
Here is a link to an article that provides a nice summary of them - with another describing them more visually - and a link to another nice article describing what happens during hyperparameter tuning. There are other ways, but they become niche and sometimes based on the data, private. Usually this training is performed automatically as shown here, but you can generate the search space yourself like in this simple example - though there are more parameters you can select from:
$1)
$ Learning rate usually changes as you get closer to optimal accuracy, since you are getting closer to optimal and do not overshoot the local minimal in the hyperplane. If you are starting from an untrained model, you want learn as much as possible with the default value of 0.064
, but the closer you get to optimal you want to minimize it to something like 0.0005
. If let's say learning rate decreases exponentially with accuracy - meaning you want to tweak the model less as you become more accurate - then it would be something like learning_rate
$= (1-(e^{accuracy-1})^\alpha)/\gamma$, where $\alpha = 5$ and $\gamma=0.1$, resulting in a chart like this:
Then you use that equation (or your own) to update the learning rate with each iteration of model training.
$2)
$ For batch size, you can have a discrete range like this batch_sizes = [16, 32, 64]
to select from.
Then for each iteration, you look at the metrics and select what to tweak, given the model you want to start from. Meaning you run through all the batch sizes, and see which one performs best. Then you use that, and go through different learning rates based on the accuracy of the resulting models. If you have other parameters you want to play with, then you empirically determine how they interact with the tuning of the model for reaching optimal accuracy.
What does this mean? This means you have to empirically try a lot of combinations, going through many iterations until you find the optimal model representing your data. Again keep in mind this generally is geared for diploid germline variant calling - which still requires some tuning - but you would need play with the tuning more if it varies a lot given your training and validation data.
Hope it helps, Paul
Thanks Paul for this very insightful comment,
I went ahead and redoing my retraining. I see improvements in the accuracy score. That helps a lot.
Now my next step is to use the retrain model to run on our internal sample. I want to convert the .ckpt to .pb file. Here is the comment I got from the tutorial:
sudo docker run \ -v $PWD:/input \ -v $PWD:/output \ google/deepvariant:"${BIN_VERSION}" \ /opt/deepvariant/bin/freeze_graph \ --checkpoint /input/model.ckpt \ --example_info_json /input/model.ckpt.example_info.json \ --output /output/model.pb
What is the model.ckpt.example_info.json
here? I don't see the json file being created when I do model_train
which is the step that generate the .ckpt
file?
Hi Sophie,
This is great news! Yes, usually that is the expected behavior after several rounds of training. Let me answer each aspect separately:
$1)
$ So given a trained model, you can specify a custom checkpoint file via --customized_model
, as shown under the Pick a model section. Probably it would be easiest to pick the one that is specified by the best_checkpoint.txt
file.
$2)
$ Regarding .pb
files, these are serialized binary ProtoBuf formatted versions of the model with additional details, denoted as the TensorFlow's SavedModel format. When performing variant calling, DeepVariant can only use .ckpt
model files - not .pb
files - as specified under the Convert model.ckpt to model.pb section. May I ask why you might need the ProtoBuf formatted file?
$3)
$ Regarding the model.ckpt.example_info.json
file, it is used to validate the checkpointed model's JSON file with the JSON file generated for the examples files, matching against a few specs such as channels and shape (tensor image dimensions), to ensure data and model match before performing variant calling. It should have been automatically copied over by the copy_over_example_info_json()
function. Basically it contains the version of DeepVariant, the tensor image shape (rows, columns, channels) and enumerated channels. There are different ones for different technologies, and below is a list of the contents for each instrument:
{"version": "1.5.0", "shape": [100, 199, 9], "channels": [1, 2, 3, 4, 5, 6, 7, 9, 10]}
{"version": "1.5.0", "shape": [100, 199, 9], "channels": [1, 2, 3, 4, 5, 6, 7, 9, 10]}
{"version": "1.5.0", "shape": [100, 221, 7], "channels": [1, 2, 3, 4, 5, 6, 19]}
{"version": "1.5.0", "shape": [100, 221, 7], "channels": [1, 2, 3, 4, 5, 6, 19]}
{"version": "1.5.0", "shape": [100, 221, 6], "channels": [1, 2, 3, 4, 5, 6]}
Feel free to generate one, or copy it from the following locations, but it should be based on the model technology you trained on (meaning they have to be exact):
Hope it helps, Paul
HI @pgrosu ,
Thank you for your answer. This is exactly what I am looking for. I need to convert into .pb
file because I'm running Deepvariant using NVIDIA parabricks container, which also include DeepVariant version 1.5.0 in it. There is a parameter to take in model file, but it has to be .pb
format.
Hi Sophie,
I'm not sure these are files you can use with the --pb-model-file
parameter of ParaBricks, as shown here (in case that might be what you are pursuing):
I think that parameter stands for ParaBricks model files, not ProtoBuf serialized SavedModel files. For example, if you take one of the files from the compressed file here:
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/resources/parabricks_deepvariant_models/files
You will notice by the following embedded string structures of each file how different they are:
deepvariant.eng
file$ strings deepvariant.eng | head -n 10
ptrt
<Y&:;
<X%8<
;SCh
fct=
s%>[j|?
8)=ql >
<x6X=Q
*s9=
#&>Q;
$
saved_model.pb
file$ strings saved_model.pb | head -n 10
AddV2
type:
input
reduction_indices"
Tidx
output
keep_dims
bool
Tidx
type
$
They do not look like the same type of structured file. Were you planning to use the --pb-model-file
parameter of ParaBricks with a TensorFlow SavedModel file?
Thanks, Paul
Thanks Paul for pointing this out,
The reason why I used Parabricks is to add optional parameters and run DeepVariant in one command. However based on what you said, I will run 3 steps (make_examples, call_variants and post_process_variants) separately so I can add optional parameters in make_examples
step
While doing call_variants
step, it generates a lot of errors. One of the error points out that input depth must be evenly divisible by filter depth: 6 vs 7
. What does it mean?
I also attached the error log error.log
@sophienguyen01 I wonder if this is because the number of channels you produced in make_examples is inconsistent with the model you used for call_variants.
For WGS and WES, the model has 7 channels. So, when you run make_examples, you need to add --channels=insert_size
. If you use the run_deepvariant
command and use WGS or WES as the --model_type
, this will be added for you directly in the make_examples step. However, if you're running make_examples separately, you need to remember to specify that on your own.
Hi @pichuan ,
That may be why because I didn't provide the --channels
parameter. what is the default value for insert_size
?
The flag to add is literally --channels=insert_size
.
The code that added it in run_deepvariant is https://github.com/google/deepvariant/blob/r1.5/scripts/run_deepvariant.py#L245
You can see https://google.github.io/deepvariant/posts/2022-06-09-adding-custom-channels/ for more information.
Closing issue due to inactivity. Feel free to open if you have other questions.
Hi,
I want to create training examples for multiple samples (specifically HG001 to HG007). Each sample has different coverages.
For this command below, do I have to repeat this command for each of the .BAM file I have, or can I have multiple
--reads
for .BAM files of the same sample with different coverages. What would be a good approach here?( time seq 0 $((N_SHARDS-1)) | \ parallel --halt 2 --joblog "${LOG_DIR}/log" --res "${LOG_DIR}" \ sudo nvidia-docker run \ -v ${HOME}:${HOME} \ google/deepvariant:"${BIN_VERSION}-gpu" \ /opt/deepvariant/bin/make_examples \ --mode training \ --ref "${REF}" \ --reads "${BAM_CHR1}" \ --examples "${OUTPUT_DIR}/training_set.with_label.tfrecord@${N_SHARDS}.gz" \ --truth_variants "${TRUTH_VCF}" \ --confident_regions "${TRUTH_BED}" \ --task {} \ --regions "'chr1'" \ ) >"${LOG_DIR}/training_set.with_label.make_examples.log" 2>&1
Also in the shuffling step, what is a correct way to shuffle training examples: shuffling all examples of each sample file (e.g HG001) or shuffling all examples of all sample files once?
Thank