Closed GuillaumeHolley closed 2 years ago
@GuillaumeHolley ,
Thank you so much! These are great points. I'll work on these this week to fix the documentations.
For your question:
In the DeepVariant training tutorial, Step 5 (DeepVariant training), the input of make_examples is the unphased BAM files. Shouldn't it be the phased BAM files we created on Step 3 instead?
Yes, you are right, it should be the phased bam that should be used for training.
@kishwarshafin
I am using the opportunity of the issue being opened to ask another question :) So I am at the step where I want to shuffle my examples for the training with DeepVariant. Unfortunately I am running out of memory regardless of the input (training or test set). I tried 2 machines: one with 2 GPUs but "only" 128GB of RAM, the other one with no GPUs but 300 GB of RAM. Both fail on memory. It seems related to this issue but using any cloud for this is not an option for me. For the models currently in place in Pepper, did you run this locally or in the cloud as well?
Thanks, Guillaume
I have managed to perform the shuffling on the test set by using a non-GPU machine with 700GB of RAM (peak RAM usage was at 605GB). I will try the same with the training set but given that the input files are much larger, I am not too hopeful.
@GuillaumeHolley , hm, yep I think you have a bit too much data to not do this locally. Please let us know if you are able to shuffle the data locally.
@kishwarshafin I haven't been able to shuffle the training data. The problem is that the script loads all the TFrecords in memory in order to shuffle them, there is no disk streaming option. These TF records, once read (uncompressed) in memory, use a lot more RAM than their corresponding disk (compressed) counterpart, by a factor 100 or so. Since the cloud is not an option for me, I am trying to re-write the script using instead multiple disk readings rather than just one disk reading. We will see how it goes.
@GuillaumeHolley , that's a great solution, if you have it working and you are willing to share the code please let me know and I'll update our documentation so it points to your script.
Sure, sounds great. Full warning though, I am not a great Python programmer and I don't want to spend too much time on this so I am just aiming for something that just "works". Algorithm and code can be greatly improved. Right now, I have something that takes as much time as the Beam version but takes 80GB of RAM vs 605GB for my test set. I am trying it on my training set, the goal is to have it working using <300GB of RAM and <24h.
Hi @kishwarshafin,
I think this is my last question regarding the training :) So in the DeepVariant training tutorial, Section "Train model" towards the end, it is written It is highly recommended to use --start_from_checkpoint option as that reduces the total consumed time for convergence.
. Would you still recommend this if the data I am training on are Illumina-corrected ONT data (correction with Ratatosk)?
@GuillaumeHolley , yes, strongly recommend that you start from one of our ONT models.
Thanks @kishwarshafin, training is in progress.
I have created a GitHub repository, TFrecordShuffler, for my script to shuffle TF records locally using more time but much less memory. To shuffle 125 GB of records, it took me 46h (wall-clock and CPU) using 150 GB of RAM.
Hi @kishwarshafin,
I have now completed the DeepVariant process and it all works great so I am closing this issue, thank you again for you work and help.
Maybe as one last suggestion for when you edit the doc: the line # Put the step number instead of **** so we can keep track of the performance of each model
is inserted in the incorrect command block. Right now, it is in the final hap.py
evaluation command block while it should be in the one before, the run_pepper_margin_deepvariant call_variant
command block.
Hi,
Thank you very much for the detailed tutorial on how to train Pepper-MARGIN-DeepVariant. In general, I thought everything was clear and well-written. I am finished with training PEPPER-SNP and PEPPER-HP. Currently, I am making my way down the DeepVariant training tutorial. I have one question about this tutorial:
make_examples
is the unphased BAM files. Shouldn't it be the phased BAM files we created on Step 3 instead?I also have a few minor comments, feel free to ignore them :)
(-s HG003)
while the evaluation is on chr20 of HG002. Same thing for PEPPER-HP training.downsample_fraction
forsamtools view -s
will be incorrect if the estimated coverage of the BAM file is larger than 100 (rare but can happen) or if the required downsampled fraction is lower than 10 (lower than 10 would be useless I imagine).Thank you! Guillaume