google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.23k stars 725 forks source link

Unclear instructions to run deepvariant on a cluster #474

Closed HamiltonG closed 3 years ago

HamiltonG commented 3 years ago

Hi there,

I've been trying to figure out how to actually run deepvariant in a cluster environment but thus far, the instructions seems a little cryptic to me. Is there perhaps a step-by-step guide to running deepvariant on a cluster with a PBS scheduler for instance?

MariaNattestad commented 3 years ago

Hi @HamiltonG

The one-step script whose usage is shown in https://github.com/google/deepvariant#how-to-run-deepvariant will work on a cluster, just note that giving it something like 64 threads will help it run faster. Our case study metrics are from runs with 64 CPU cores and no GPU, so those numbers should give you an idea if that works for your purposes.

For the simple run_deepvariant case, if Docker isn't available to you on your cluster, the same container and commands can be used with Singularity. This is I think what most people do when running on a cluster.

If you really want to optimize a process to run DeepVariant many times, it can be worth running the 3 stages separately and giving them different resources because make_examples wants many CPUs, call_variants runs faster on GPUs, and postprocess_variants really just needs 1 CPU. The external solutions do variations of this plus their own special sauce.

I hope that helps answer your question, Maria

HamiltonG commented 3 years ago

Hi Maria,

Thank you for your suggestions. I am getting closer to running but I have not quite succeeded yet.

Here is where I am at the moment :

Lets say my 'deepvariant_v1.0.0.sif' file is sitting in path_a my reference sequence in path_b my pacbio bam in path_c

Could you advise on the singularity command to execute this job?

Below is what i've tried but I must be missing and misunderstanding a few key elements.

module load chpc/singularity

singularity run -B /mnt/lustre3p/groups/CBBI0843:/mnt/lustre3p/groups/CBBI0843 /mnt/lustre3p/groups/CBBI0843/deepvariant_v1.0.0.sif \ /opt/deepvariant/bin/run_deepvariant \ --model_type=PACBIO --ref="References/Panu3.0_X_Y_Mito.fa" \ --reads=/mnt/lustre3p/groups/CBBI0843/IB_Sequel_IIe_data/r64187e_20210614_132743/3_C01/m64187e_210617_061402.reads.bam \ --output_vcf=/mnt/lustre3p/users/hganesan/210614_Cell3_MT18_output.vcf.gz \ --output_gvcf=/mnt/lustre3p/users/hganesan/210614_Cell3_MT18_output.g.vcf.gz \

Thank you for your insights.

Kind regards,

MariaNattestad commented 3 years ago

It looks like you're not mounting any directories in your Singularity command.

See the FAQ for how to debug that: https://github.com/google/deepvariant/blob/r1.1/docs/FAQ.md#why-cant-it-find-one-of-the-input-files-eg-could-not-open -- "Why can't it find one of the input files? E.g., "Could not open""

If that doesn't work, can you include the error messages too?

MariaNattestad commented 3 years ago

I'll close this since we're continuing the conversation over email.