kishwarshafin / pepper

PEPPER-Margin-DeepVariant
MIT License
245 stars 42 forks source link

Training questions #123

Closed GuillaumeHolley closed 2 years ago

GuillaumeHolley commented 2 years ago

Hi,

Thank you very much for the detailed tutorial on how to train Pepper-MARGIN-DeepVariant. In general, I thought everything was clear and well-written. I am finished with training PEPPER-SNP and PEPPER-HP. Currently, I am making my way down the DeepVariant training tutorial. I have one question about this tutorial:

I also have a few minor comments, feel free to ignore them :)

  1. In the PEPPER-SNP training tutorial, Step 4 (Evaluating a trained model), the given sample is HG003 (-s HG003) while the evaluation is on chr20 of HG002. Same thing for PEPPER-HP training.
  2. The computation of downsample_fraction for samtools view -s will be incorrect if the estimated coverage of the BAM file is larger than 100 (rare but can happen) or if the required downsampled fraction is lower than 10 (lower than 10 would be useless I imagine).
  3. Maybe one suggestion: not all steps require GPUs so it would be practical to add for every step if a GPU machine is required or not. I think it is useful when working on a compute cluster where the number of GPU machines is limited while the number of non-GPU machines is "large".

Thank you! Guillaume

kishwarshafin commented 2 years ago

@GuillaumeHolley ,

Thank you so much! These are great points. I'll work on these this week to fix the documentations.

For your question:

In the DeepVariant training tutorial, Step 5 (DeepVariant training), the input of make_examples is the unphased BAM files. Shouldn't it be the phased BAM files we created on Step 3 instead?

Yes, you are right, it should be the phased bam that should be used for training.

GuillaumeHolley commented 2 years ago

@kishwarshafin

I am using the opportunity of the issue being opened to ask another question :) So I am at the step where I want to shuffle my examples for the training with DeepVariant. Unfortunately I am running out of memory regardless of the input (training or test set). I tried 2 machines: one with 2 GPUs but "only" 128GB of RAM, the other one with no GPUs but 300 GB of RAM. Both fail on memory. It seems related to this issue but using any cloud for this is not an option for me. For the models currently in place in Pepper, did you run this locally or in the cloud as well?

Thanks, Guillaume

GuillaumeHolley commented 2 years ago

I have managed to perform the shuffling on the test set by using a non-GPU machine with 700GB of RAM (peak RAM usage was at 605GB). I will try the same with the training set but given that the input files are much larger, I am not too hopeful.

kishwarshafin commented 2 years ago

@GuillaumeHolley , hm, yep I think you have a bit too much data to not do this locally. Please let us know if you are able to shuffle the data locally.

GuillaumeHolley commented 2 years ago

@kishwarshafin I haven't been able to shuffle the training data. The problem is that the script loads all the TFrecords in memory in order to shuffle them, there is no disk streaming option. These TF records, once read (uncompressed) in memory, use a lot more RAM than their corresponding disk (compressed) counterpart, by a factor 100 or so. Since the cloud is not an option for me, I am trying to re-write the script using instead multiple disk readings rather than just one disk reading. We will see how it goes.

kishwarshafin commented 2 years ago

@GuillaumeHolley , that's a great solution, if you have it working and you are willing to share the code please let me know and I'll update our documentation so it points to your script.

GuillaumeHolley commented 2 years ago

Sure, sounds great. Full warning though, I am not a great Python programmer and I don't want to spend too much time on this so I am just aiming for something that just "works". Algorithm and code can be greatly improved. Right now, I have something that takes as much time as the Beam version but takes 80GB of RAM vs 605GB for my test set. I am trying it on my training set, the goal is to have it working using <300GB of RAM and <24h.

GuillaumeHolley commented 2 years ago

Hi @kishwarshafin,

I think this is my last question regarding the training :) So in the DeepVariant training tutorial, Section "Train model" towards the end, it is written It is highly recommended to use --start_from_checkpoint option as that reduces the total consumed time for convergence.. Would you still recommend this if the data I am training on are Illumina-corrected ONT data (correction with Ratatosk)?

kishwarshafin commented 2 years ago

@GuillaumeHolley , yes, strongly recommend that you start from one of our ONT models.

GuillaumeHolley commented 2 years ago

Thanks @kishwarshafin, training is in progress.

I have created a GitHub repository, TFrecordShuffler, for my script to shuffle TF records locally using more time but much less memory. To shuffle 125 GB of records, it took me 46h (wall-clock and CPU) using 150 GB of RAM.

GuillaumeHolley commented 2 years ago

Hi @kishwarshafin,

I have now completed the DeepVariant process and it all works great so I am closing this issue, thank you again for you work and help.

Maybe as one last suggestion for when you edit the doc: the line # Put the step number instead of **** so we can keep track of the performance of each model is inserted in the incorrect command block. Right now, it is in the final hap.py evaluation command block while it should be in the one before, the run_pepper_margin_deepvariant call_variant command block.