Checkpoint "Model files do not exist" when testing custom model

helizabeth1103 commented 3 months ago

Hello, I trained a customized model, and am now trying to test it. However, when I try to run it, it says that the model files in the checkpoint do not exist.

Here is the command I tried to run:

module load apptainer

apptainer exec deepvariant_1.6.0.sif /opt/deepvariant/bin/run_deepvariant \ --model_type WGS \ --customized_model "/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_fulltest/output/modeltrainout/2fullindividualmodel/checkpoints/ckpt-14902" \ --ref "/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/Bactrocera_dorsalis_rearing_male_mt_chr_unpl.fasta" \ --reads "${filesdir}_mapped/${sample}.bam" \ --output_vcf "/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_fulltest/output/modeltrainout/modeltestout/2fullindividualmodeltest/${sample}.vcf.gz"

Here are the contents of the checkpoints folder for this training:

drwxr-s--- 3 haley.arnold proj-pbarc 4.0K Jun 29 01:06 .. drwxr-s--- 3 haley.arnold proj-pbarc 4.0K Jul 1 22:49 . drwxr-s--- 3 haley.arnold proj-pbarc 4.0K Jul 21 23:11 ckpt-14902 -rw-r----- 1 haley.arnold proj-pbarc 54K Aug 6 22:51 ckpt-7451.index -rw-r----- 1 haley.arnold proj-pbarc 250M Aug 6 22:51 ckpt-7451.data-00000-of-00001 -rw-r----- 1 haley.arnold proj-pbarc 54K Aug 6 22:51 ckpt-14902.index -rw-r----- 1 haley.arnold proj-pbarc 250M Aug 6 22:51 ckpt-14902.data-00000-of-00001 -rw-r----- 1 haley.arnold proj-pbarc 266 Aug 6 22:51 checkpoint

and finally, here are the contents of ckpt-14902:

total 7.6M drwxr-s--- 3 haley.arnold proj-pbarc 4.0K Jul 1 22:49 .. drwxr-s--- 2 haley.arnold proj-pbarc 4.0K Jul 1 22:49 variables drwxr-s--- 3 haley.arnold proj-pbarc 4.0K Jul 21 23:11 . -rw-r----- 1 haley.arnold proj-pbarc 6.9M Aug 6 22:51 saved_model.pb -rw-r----- 1 haley.arnold proj-pbarc 677K Aug 6 22:51 keras_metadata.pb -rw-r----- 1 haley.arnold proj-pbarc 55 Aug 6 22:51 fingerprint.pb -rw-r----- 1 haley.arnold proj-pbarc 80 Aug 6 22:51 example_info.json

Here is the error log file:

2024-08-09 20:05:25.101938: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. I0809 20:05:40.093672 139993880950592 run_deepvariant.py:519] Re-using the directory for intermediate results in /tmp/tmp4wzl_5p3 Traceback (most recent call last): File "/opt/deepvariant/bin/run_deepvariant.py", line 722, in app.run(main) File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 312, in run _run_main(main, args) File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 258, in _run_main sys.exit(main(argv)) File "/opt/deepvariant/bin/run_deepvariant.py", line 693, in main commands_logfiles = create_all_commands_and_logfiles(intermediate_results_dir) File "/opt/deepvariant/bin/run_deepvariant.py", line 572, in create_all_commands_and_logfiles check_flags() File "/opt/deepvariant/bin/run_deepvariant.py", line 544, in check_flags raise RuntimeError( RuntimeError: The model files /90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_fulltest/output/modeltrainout/2fullindividualmodel/checkpoints/ckpt-14902* do not exist. Potentially relevant issue: https://github.com/google/deepvariant/blob/r1.6/docs/FAQ.md#why-cant-it-find-one-of-the-input-files-eg-could-not-open

Can someone please help me figure out what's going on? The link provided showed a different set of files than the ones I have. Am I missing files? Is something upstream not functioning properly? I have trained models before, with the same kinds out output, and have been able to test them before. What am I missing?

Thank you for your help!

Best, Haley Arnold

pichuan commented 3 months ago

Hi @helizabeth1103 , the logic is in https://github.com/google/deepvariant/blob/r1.6.1/scripts/run_deepvariant.py#L529-L559

Can you double check that you have this file:

/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_fulltest/output/modeltrainout/2fullindividualmodel/checkpoints/ckpt-14902/saved_model.pb?

If you have that file, then this should be true:

    use_saved_model = tf.io.gfile.exists(
        _CUSTOMIZED_MODEL.value
    ) and tf.io.gfile.exists(f'{_CUSTOMIZED_MODEL.value}/saved_model.pb')

And then:

    if use_saved_model:
      logging.info('Using saved model: %s', str(use_saved_model))

You should be able to see the Using saved model logging.

kishwarshafin commented 3 months ago

@helizabeth1103 , closing this due to no activity. Please feel free to reopen if you need further help. It looks like you have checkpoints:

/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_fulltest/output/modeltrainout/2fullindividualmodel/checkpoints/

And saved models both. Just trying to understand which one you are trying to use. Please reply with the outputs so we can understand the issue better.

google / deepvariant

Checkpoint "Model files do not exist" when testing custom model #866