training on multiple samples

google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

BSD 3-Clause "New" or "Revised" License

3.17k stars 716 forks source link

training on multiple samples #706

Closed sophienguyen01 closed 11 months ago

sophienguyen01 commented 1 year ago

Hi,

I want to create training examples for multiple samples (specifically HG001 to HG007). Each sample has different coverages.

For this command below, do I have to repeat this command for each of the .BAM file I have, or can I have multiple --reads for .BAM files of the same sample with different coverages. What would be a good approach here? ( time seq 0 $((N_SHARDS-1)) | \ parallel --halt 2 --joblog "${LOG_DIR}/log" --res "${LOG_DIR}" \ sudo nvidia-docker run \ -v ${HOME}:${HOME} \ google/deepvariant:"${BIN_VERSION}-gpu" \ /opt/deepvariant/bin/make_examples \ --mode training \ --ref "${REF}" \ --reads "${BAM_CHR1}" \ --examples "${OUTPUT_DIR}/training_set.with_label.tfrecord@${N_SHARDS}.gz" \ --truth_variants "${TRUTH_VCF}" \ --confident_regions "${TRUTH_BED}" \ --task {} \ --regions "'chr1'" \ ) >"${LOG_DIR}/training_set.with_label.make_examples.log" 2>&1

Also in the shuffling step, what is a correct way to shuffle training examples: shuffling all examples of each sample file (e.g HG001) or shuffling all examples of all sample files once?

Thank

pgrosu commented 1 year ago

Hi Sophie,

Similarly to my reply in https://github.com/google/deepvariant/issues/698#issuecomment-1711046219, it is best shuffle all the TFRecords across all the samples. If you have bad samples, then skip those. If you want to see the difference among samples, train then individually, but that will generate multiple per-sample-biased models. What you really want is one good representative model of all your samples, as that will be the model that will be presented to future samples. Thus you want as many good representative samples to train that model. That is why you want to shuffle across all your samples.

The make_examples script only takes one BAM file via the --reads parameter, and you don't need to merge multiple BAM files into one, as the shuffling happens afterwards on the generated TFRecords. So the process is roughly as follows: 1) Run make_examples on training set BAMs, and run make_examples on validation set BAMs -- both of which will generate TFRecords. 2) Shuffle TFRecords for training set, then shuffle the ones for validation set separately. 3) Run model_train on shuffled training set shuffled data. 4) Run model_eval on shuffled validation set data to evaluate generated checkpoints, which will generate best_checkpoint.txt and best_checkpoint.metrics files. 5) Pick best model listed in the best_checkpoint.txt file. 6) Test the best model with run_deepvariant by providing it to the --customized_model parameter, and for the --reads parameter setting it with the test data BAM file.
7) Benchmark by comparing your VCF with your Truth set VCF via hap.py, and check the metrics against a baseline appropriate to your study. 8) Then if you want, you could train/retrain a (new or previous) model by tuning the modeling parameters and/or updating the training data - i.e. if you want to make it more flexible to capture more variety in the input data - both of which might improve the model under different conditions.

As Maria mentioned previously, training is done on chromosome 1-19, then evaluation on 21-22, with a test on 20. Usually all training is done on some data, then evaluated on another for picking the best model, and finally the best model would be tested with the test data.

Let me know if I should expand on anything.

Hope it helps, Paul

pichuan commented 1 year ago

Hi @sophienguyen01 ,

if you want to generate examples from multiple samples, you'll need to run make_examples separately multiple times. And then you can use the shuffling step to put all the examples from multiple samples in the same dataset for training.

sophienguyen01 commented 1 year ago

Thank you all for your suggestions, I went ahead and shuffle across my training examples. I also done the model_train and model_eval as in the tutorial but the evaluation result on HG003_chr20 does not look good at all. The F1 score drop dramatically ( 0 for one trained model and almost 0 for another model).

So far, I kept the same value for paramater inmodel_train step: --model_name="inception_v3" \ --number_of_steps=50000 \ --save_interval_secs=300 \ --batch_size=32 \ --learning_rate=0.0005 \

I know that changing these values will affect the evaluation results. I went through DeepVariant paper but couldn't find any information on what parameters were used during the training. which values of the training parameter do you suggest that I can try?

pgrosu commented 1 year ago

Hi Sophie,

So tuning a model is both an art and a science. Yes, accuracy can be lower initially, as in the following result described by Pi-Chuan in another post:

Thus no training parameters are the same for any model, since the underlying data is not the same and the goals usually are not equal. Therefore the approach for optimizing model is a journey of discovery performed via hyperparameter tuning. For example, Google uses Vizier, but the idea falls into one of five general camps:

Manual Tuning
Random Search
Grid Search
Bayesian Optimization
Tree-structured Parzen estimators

Here is a link to an article that provides a nice summary of them - with another describing them more visually - and a link to another nice article describing what happens during hyperparameter tuning. There are other ways, but they become niche and sometimes based on the data, private. Usually this training is performed automatically as shown here, but you can generate the search space yourself like in this simple example - though there are more parameters you can select from:

$1)$ Learning rate usually changes as you get closer to optimal accuracy, since you are getting closer to optimal and do not overshoot the local minimal in the hyperplane. If you are starting from an untrained model, you want learn as much as possible with the default value of 0.064, but the closer you get to optimal you want to minimize it to something like 0.0005. If let's say learning rate decreases exponentially with accuracy - meaning you want to tweak the model less as you become more accurate - then it would be something like learning_rate $= (1-(e^{accuracy-1})^\alpha)/\gamma$, where $\alpha = 5$ and $\gamma=0.1$, resulting in a chart like this:

Then you use that equation (or your own) to update the learning rate with each iteration of model training.

$2)$ For batch size, you can have a discrete range like this batch_sizes = [16, 32, 64] to select from.

Then for each iteration, you look at the metrics and select what to tweak, given the model you want to start from. Meaning you run through all the batch sizes, and see which one performs best. Then you use that, and go through different learning rates based on the accuracy of the resulting models. If you have other parameters you want to play with, then you empirically determine how they interact with the tuning of the model for reaching optimal accuracy.

What does this mean? This means you have to empirically try a lot of combinations, going through many iterations until you find the optimal model representing your data. Again keep in mind this generally is geared for diploid germline variant calling - which still requires some tuning - but you would need play with the tuning more if it varies a lot given your training and validation data.

Hope it helps, Paul

sophienguyen01 commented 12 months ago

Thanks Paul for this very insightful comment,

I went ahead and redoing my retraining. I see improvements in the accuracy score. That helps a lot.

Now my next step is to use the retrain model to run on our internal sample. I want to convert the .ckpt to .pb file. Here is the comment I got from the tutorial:

sudo docker run \ -v $PWD:/input \ -v $PWD:/output \ google/deepvariant:"${BIN_VERSION}" \ /opt/deepvariant/bin/freeze_graph \ --checkpoint /input/model.ckpt \ --example_info_json /input/model.ckpt.example_info.json \ --output /output/model.pb

What is the model.ckpt.example_info.json here? I don't see the json file being created when I do model_train which is the step that generate the .ckpt file?

pgrosu commented 12 months ago

Hi Sophie,

This is great news! Yes, usually that is the expected behavior after several rounds of training. Let me answer each aspect separately:

$1)$ So given a trained model, you can specify a custom checkpoint file via --customized_model, as shown under the Pick a model section. Probably it would be easiest to pick the one that is specified by the best_checkpoint.txt file.

$2)$ Regarding .pb files, these are serialized binary ProtoBuf formatted versions of the model with additional details, denoted as the TensorFlow's SavedModel format. When performing variant calling, DeepVariant can only use .ckpt model files - not .pb files - as specified under the Convert model.ckpt to model.pb section. May I ask why you might need the ProtoBuf formatted file?

$3)$ Regarding the model.ckpt.example_info.json file, it is used to validate the checkpointed model's JSON file with the JSON file generated for the examples files, matching against a few specs such as channels and shape (tensor image dimensions), to ensure data and model match before performing variant calling. It should have been automatically copied over by the copy_over_example_info_json() function. Basically it contains the version of DeepVariant, the tensor image shape (rows, columns, channels) and enumerated channels. There are different ones for different technologies, and below is a list of the contents for each instrument:

ONT R10.4

{"version": "1.5.0", "shape": [100, 199, 9], "channels": [1, 2, 3, 4, 5, 6, 7, 9, 10]}

PacBio

{"version": "1.5.0", "shape": [100, 199, 9], "channels": [1, 2, 3, 4, 5, 6, 7, 9, 10]}

WES

{"version": "1.5.0", "shape": [100, 221, 7], "channels": [1, 2, 3, 4, 5, 6, 19]}

WGS

{"version": "1.5.0", "shape": [100, 221, 7], "channels": [1, 2, 3, 4, 5, 6, 19]}

Hybrid PacBio/Illumina

{"version": "1.5.0", "shape": [100, 221, 6], "channels": [1, 2, 3, 4, 5, 6]}

Feel free to generate one, or copy it from the following locations, but it should be based on the model technology you trained on (meaning they have to be exact):

Hope it helps, Paul

sophienguyen01 commented 12 months ago

HI @pgrosu ,

Thank you for your answer. This is exactly what I am looking for. I need to convert into .pb file because I'm running Deepvariant using NVIDIA parabricks container, which also include DeepVariant version 1.5.0 in it. There is a parameter to take in model file, but it has to be .pb format.

pgrosu commented 12 months ago

Hi Sophie,

I'm not sure these are files you can use with the --pb-model-file parameter of ParaBricks, as shown here (in case that might be what you are pursuing):

https://docs.nvidia.com/clara/parabricks/4.0.0/documentation/tooldocs/man_deepvariant.html#models-for-additional-gpus

I think that parameter stands for ParaBricks model files, not ProtoBuf serialized SavedModel files. For example, if you take one of the files from the compressed file here:

https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/resources/parabricks_deepvariant_models/files

You will notice by the following embedded string structures of each file how different they are:

The ParaBricks `deepvariant.eng` file

$ strings deepvariant.eng  | head -n 10
ptrt
<Y&:;
<X%8<
;SCh
fct=
s%>[j|?
8)=ql   >
<x6X=Q
*s9=
#&>Q;
$

A TensorFlow `saved_model.pb` file

$ strings saved_model.pb | head -n 10
AddV2
type:
input
reduction_indices"
Tidx
output
        keep_dims
bool
Tidx
type
$

They do not look like the same type of structured file. Were you planning to use the --pb-model-file parameter of ParaBricks with a TensorFlow SavedModel file?

Thanks, Paul

sophienguyen01 commented 12 months ago

Thanks Paul for pointing this out,

The reason why I used Parabricks is to add optional parameters and run DeepVariant in one command. However based on what you said, I will run 3 steps (make_examples, call_variants and post_process_variants) separately so I can add optional parameters in make_examples step

While doing call_variants step, it generates a lot of errors. One of the error points out that input depth must be evenly divisible by filter depth: 6 vs 7. What does it mean?

I also attached the error log error.log

pichuan commented 12 months ago

@sophienguyen01 I wonder if this is because the number of channels you produced in make_examples is inconsistent with the model you used for call_variants.

For WGS and WES, the model has 7 channels. So, when you run make_examples, you need to add --channels=insert_size. If you use the run_deepvariant command and use WGS or WES as the --model_type, this will be added for you directly in the make_examples step. However, if you're running make_examples separately, you need to remember to specify that on your own.

sophienguyen01 commented 12 months ago

Hi @pichuan ,

That may be why because I didn't provide the --channels parameter. what is the default value for insert_size ?

pichuan commented 12 months ago

The flag to add is literally --channels=insert_size.

The code that added it in run_deepvariant is https://github.com/google/deepvariant/blob/r1.5/scripts/run_deepvariant.py#L245

You can see https://google.github.io/deepvariant/posts/2022-06-09-adding-custom-channels/ for more information.

pichuan commented 11 months ago

Closing issue due to inactivity. Feel free to open if you have other questions.

google / deepvariant

training on multiple samples #706

ONT R10.4

PacBio

WES

WGS

Hybrid PacBio/Illumina

ONT R10.4

PacBio

WES

WGS

Hybrid PacBio/Illumina

The ParaBricks deepvariant.eng file

A TensorFlow saved_model.pb file

The ParaBricks `deepvariant.eng` file

A TensorFlow `saved_model.pb` file