motional / nuplan-devkit

The devkit of the nuPlan dataset.
https://www.nuplan.org
Other
662 stars 126 forks source link

Question for Submitting the ml planner from #306 #308

Closed changliucoding closed 1 year ago

changliucoding commented 1 year ago

Hi this question is still the question from #306. I did exactly as you told us in #302, but it still failed. I wonder what should i put in checkpoint_path? I checked my docker images files, my model.ckpt is indeed in the nuplan-devkit. I don't know what did I miss. thanks!

abbyxxn commented 1 year ago

I meet the same question. My submission file is {"submitted_image_uri": "937891341272.dkr.ecr.us-east-1.amazonaws.com/nuplan-planning-challenge-1856-participant-team-20999:4ea52d5d-51b1-4faf-9331-e89f24ae6b53"} and meet Merged stderr: validation_challenge99.log 2023-05-20 02:12:53,117 : ERROR : Planner initialization failed! 2023-05-20 02:12:58,389 : ERROR : Planner initialization failed!

I did docker-compose up --build and it success in my local server.

changliucoding commented 1 year ago

I meet the same question. My submission file is {"submitted_image_uri": "937891341272.dkr.ecr.us-east-1.amazonaws.com/nuplan-planning-challenge-1856-participant-team-20999:4ea52d5d-51b1-4faf-9331-e89f24ae6b53"} and meet Merged stderr: validation_challenge99.log 2023-05-20 02:12:53,117 : ERROR : Planner initialization failed! 2023-05-20 02:12:58,389 : ERROR : Planner initialization failed!

I did docker-compose up --build and it success in my local server.

same question seems, don't know how to fix it.

patk-motional commented 1 year ago

@abbyxxn,

This is the error during initialization:

Traceback (most recent call last):
--
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/grpc/_server.py", line 443, in _call_behavior
response_or_iterator = behavior(argument, context)
File "/nuplan_devkit/nuplan/submission/challenge_servicers.py", line 98, in InitializePlanner
planners = build_planners(self._planner_config, None)
File "/nuplan_devkit/nuplan/planning/script/builders/planner_builder.py", line 70, in build_planners
planner = cache.get(name, _build_planner(config, scenario))
File "/nuplan_devkit/nuplan/planning/script/builders/planner_builder.py", line 24, in _build_planner
if is_target_type(planner_cfg, MLPlanner):
File "/nuplan_devkit/nuplan/planning/script/builders/utils/utils_type.py", line 23, in is_target_type
return bool(_locate(cfg._target_) == target_type)
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/hydra/_internal/utils.py", line 577, in _locate
raise ImportError(
ImportError: Encountered error: `No module named 'transformer4planning.submission'` when loading module 'transformer4planning.submission.planner.ControlTFPlanner'
patk-motional commented 1 year ago

@changliucoding,

What is your team name and time of submission? I can look up the detailed logs for you

changliucoding commented 1 year ago

Hi, my team name is changdrive. Thank you very much!

Fan-Yixuan commented 1 year ago

Hi @patk-motional, Same problem here, the submitted file is {"submitted_image_uri": "937891341272.dkr.ecr.us-east-1.amazonaws.com/nuplan-planning-challenge-1856-participant-team-16281:6095e0da-394d-4c89-8e37-159cc30b67fc"}, could you please find detailed logs for me? Thanks a lot.

MMz000 commented 1 year ago

Hi, @patk-motional,We have also encountered a similar problem. We followed the tutorial and used the COPY command in the Dockerfile to copy the ckpt file. However, when running the docker-compose up --build command, we received a file not found error.

To troubleshoot this, we added an "ls" command to check if the ckpt file is present in the nuplan-devkit folder. The result showed that the file does exist.

It is worth noting that we were able to successfully run the simple planner.

Here are some of our configurations.

Dockerfile & Dockerfile.submission

c8a4707f2822f11bb4db2b3906b36c3e 0f16147d990525546660d61fc3fcf85c

entrtpoint_submission.sh & entrtpoint_simulation.sh

b0cae972890bf67597b971389af512d3 a41a5638215c6eb87c5cbcc36d7f75d1

ml_planner.yaml

9ccd4d1a70a02b1fe5ac4b7087d1add9

Here is the output of the ls command executed in the entrypoint_simulation.sh script

5b2278d48901d0d132521d578a284003

Following that, the command docker-compose up --build encountered the following error:

6bbe05beb318d0e3ca38121032f7c256

gianmarco-motional commented 1 year ago

@MMz000, the ls command is performed in the wrong container (the simulation one, and you should not modify that entrypoint anyway as we will use our version on our servers).

The error is because you are looking for /nuplan-devkit/ub_ours.ckpt, while the file is copied to nuplan_devkit/ub_ours.ckpt (note hyphen vs underscore in the devkit name)

patk-motional commented 1 year ago

@Fan-Yixuan,

There are no detailed logs for you as your container failed to start.

patk-motional commented 1 year ago

@changliucoding,

Your hydra config isn't setup properly

Traceback (most recent call last):
--
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/grpc/_server.py", line 443, in _call_behavior
response_or_iterator = behavior(argument, context)
File "/nuplan_devkit/nuplan/submission/challenge_servicers.py", line 98, in InitializePlanner
planners = build_planners(self._planner_config, None)
File "/nuplan_devkit/nuplan/planning/script/builders/planner_builder.py", line 58, in build_planners
return [_build_planner(planner, scenario) for planner in planner_cfg.values()]
File "/nuplan_devkit/nuplan/planning/script/builders/planner_builder.py", line 58, in <listcomp>
return [_build_planner(planner, scenario) for planner in planner_cfg.values()]
File "/nuplan_devkit/nuplan/planning/script/builders/planner_builder.py", line 25, in _build_planner
torch_module_wrapper = build_torch_module_wrapper(planner_cfg.model_config)
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 357, in __getattr__
self._format_and_raise(key=key, value=None, cause=e)
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/omegaconf/base.py", line 190, in _format_and_raise
format_and_raise(
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/omegaconf/_utils.py", line 821, in format_and_raise
_raise(ex, cause)
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/omegaconf/_utils.py", line 719, in _raise
raise ex.with_traceback(sys.exc_info()[2])  # set end OC_CAUSE=1 for full backtrace
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 351, in __getattr__
return self._get_impl(key=key, default_value=_DEFAULT_MARKER_)
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 445, in _get_impl
return self._resolve_with_default(
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/omegaconf/basecontainer.py", line 58, in _resolve_with_default
raise MissingMandatoryValue("Missing mandatory value: $FULL_KEY")
omegaconf.errors.MissingMandatoryValue: Missing mandatory value: planner.ml_planner.model_config
full_key: planner.ml_planner.model_config
object_type=dict
abbyxxn commented 1 year ago

My submission file is {"submitted_image_uri": "937891341272.dkr.ecr.us-east-1.amazonaws.com/nuplan-planning-challenge-1856-participant-team-20999:92e91991-f365-46b5-9493-acfa2dd57e96"} and meet Merged stderr: validation_challenge99.log 2023-05-24 06:37:08,919 : ERROR : Trajectory computation service failed! 2023-05-24 06:37:08,919 : ERROR : Trajectory computation timed out: StatusCode.DEADLINE_EXCEEDED 2023-05-24 06:37:46,718 : ERROR : Trajectory computation service failed! 2023-05-24 06:37:46,718 : ERROR : Trajectory computation timed out: StatusCode.DEADLINE_EXCEEDED

I did docker-compose up --build and it success in my local server. Can I know the detailed error message?

patk-motional commented 1 year ago

Hi @abbyxxn,

I only see logs for the initialization stage. This usually indicates that your planner timed out in the first iteration. Have you profiled your planner locally?

INFO:nuplan.submission.submission_planner:Server starting...
--
INFO:nuplan.submission.submission_planner:Server started!
2023-05-24 06:36:44.757755: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-24 06:36:44.893808: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-05-24 06:36:47.446118: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-05-24 06:36:47.446229: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-05-24 06:36:47.446244: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Pretrained GPT nonautofrom /nuplan_devkit/test-loss0.25
INFO:nuplan.submission.challenge_servicers:Initialization request received..
INFO:root:Planner initialized!
count:  1 True
/nuplan_devkit/nuplan/common/maps/nuplan_map/utils.py:413: RuntimeWarning: invalid value encountered in cast
return elements.iloc[np.where(elements[column_label].to_numpy().astype(int) == int(desired_value))]
/opt/conda/envs/nuplan/lib/python3.9/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Pretrained GPT nonautofrom /nuplan_devkit/test-loss0.25
INFO:nuplan.submission.challenge_servicers:Initialization request received..
INFO:root:Planner initialized!
count:  2 True
abbyxxn commented 1 year ago

Thank you for your very useful reply! I have new submission file is {"submitted_image_uri": "937891341272.dkr.ecr.us-east-1.amazonaws.com/nuplan-planning-challenge-1856-participant-team-20999:6f8860cd-d1c7-4e80-92b6-5dde3889c687"} and meet Merged stderr: validation_challenge99.log 2023-05-24 17:15:08,181 : ERROR : Trajectory computation service failed! 2023-05-24 17:15:08,181 : ERROR : Trajectory computation timed out: StatusCode.DEADLINE_EXCEEDED 2023-05-24 17:15:51,685 : ERROR : Trajectory computation service failed! 2023-05-24 17:15:51,686 : ERROR : Trajectory computation timed out: StatusCode.DEADLINE_EXCEEDED

May I know the detailed error message?

gianmarco-motional commented 1 year ago

@abbyxxn do you mean logs from your container? I see many prints like this (this one is the last one, most are around 0.89 in time consumed)

count:  24 True
--
(224, 224, 109) (224, 224, 109) (10, 4) 22
time after ratser build 0.8149843215942383
time after gpt 0.943516731262207
time consumed 1.1793630123138428
tinkei commented 1 year ago

Can I ask about my error logs for my submissions for team NaNny as well? It's a ml_planner using raster_model using the default resnet50 backbone. Both the trained model, and the huggingface/timm cache are copied to the image. I keep getting this error without any details that I could act on:

Merged stderr:
validation_challenge99.log
2023-05-25 10:15:49,113 : ERROR : Planner initialization failed!
2023-05-25 10:16:29,764 : ERROR : Planner initialization failed!

I tried the following syntax:


Submitted at May 25, 2023 11:50:58 AM

{submitted_image_uri | "937891341272.dkr.ecr.us-east-1.amazonaws.com/nuplan-planning-challenge-1856-participant-team-16386:271ffe90-3fad-460b-90ed-cd662567e140"}

ml_planner.yaml:

model_config: ???
checkpoint_path: /nuplan_devkit/best_model.ckpt

entrypoint_submission.sh:

conda run -n nuplan --no-capture-output python -u nuplan/planning/script/run_submission_planner.py output_dir=/tmp/ model=raster_model planner=ml_planner planner.ml_planner.model_config=\${model}

Submitted at May 25, 2023 1:26:01 PM

{"submitted_image_uri": "937891341272.dkr.ecr.us-east-1.amazonaws.com/nuplan-planning-challenge-1856-participant-team-16386:6107a367-7691-41f0-a588-13d249104f19"}

ml_planner.yaml:

model_config: ???
checkpoint_path: /nuplan_devkit/best_model.ckpt

entrypoint_submission.sh:

conda run -n nuplan --no-capture-output python -u nuplan/planning/script/run_submission_planner.py output_dir=/tmp/ model=raster_model planner=ml_planner planner.ml_planner.model_config=raster_model planner.ml_planner.checkpoint_path=/nuplan_devkit/best_model.ckpt

Submitted at May 25, 2023 2:33:31 PM

{"submitted_image_uri": "937891341272.dkr.ecr.us-east-1.amazonaws.com/nuplan-planning-challenge-1856-participant-team-16386:fa16eba8-d729-4730-ae7a-685ccb56db4a"}

ml_planner.yaml:

model_config: ${model}
checkpoint_path: /nuplan_devkit/best_model.ckpt

entrypoint_submission.sh:

conda run -n nuplan --no-capture-output python -u nuplan/planning/script/run_submission_planner.py output_dir=/tmp/ model=raster_model planner=ml_planner
patk-motional commented 1 year ago

Please check this link https://evalai.s3.amazonaws.com/media/submission_files/submission_283754/236ccfc8-a510-4940-96aa-ce3a0a0f8a2a.txt

tinkei commented 1 year ago

https://evalai.s3.amazonaws.com/media/submission_files/submission_283754/236ccfc8-a510-4940-96aa-ce3a0a0f8a2a.txt

Thank you for your prompt response. But this is an old submission from last night, before I included the huggingface/timm cache into Dockerfile.submission. Yet the submissions (the ones I cited above) are still failing this afternoon.

patk-motional commented 1 year ago
Traceback (most recent call last):
--
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/grpc/_server.py", line 443, in _call_behavior
response_or_iterator = behavior(argument, context)
File "/nuplan_devkit/nuplan/submission/challenge_servicers.py", line 98, in InitializePlanner
planners = build_planners(self._planner_config, None)
File "/nuplan_devkit/nuplan/planning/script/builders/planner_builder.py", line 58, in build_planners
return [_build_planner(planner, scenario) for planner in planner_cfg.values()]
File "/nuplan_devkit/nuplan/planning/script/builders/planner_builder.py", line 58, in <listcomp>
return [_build_planner(planner, scenario) for planner in planner_cfg.values()]
File "/nuplan_devkit/nuplan/planning/script/builders/planner_builder.py", line 25, in _build_planner
torch_module_wrapper = build_torch_module_wrapper(planner_cfg.model_config)
File "/nuplan_devkit/nuplan/planning/script/builders/model_builder.py", line 19, in build_torch_module_wrapper
model = instantiate(cfg)
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 180, in instantiate
return instantiate_node(config, *args, recursive=_recursive_, convert=_convert_)
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 249, in instantiate_node
return _call_target(target, *args, **kwargs)
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 64, in _call_target
raise type(e)(
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 62, in _call_target
return target(*args, **kwargs)
File "/nuplan_devkit/nuplan/planning/training/modeling/models/raster_model.py", line 62, in __init__
self._model = timm.create_model(model_name, pretrained=pretrained, num_classes=0, in_chans=num_input_channels)
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/timm/models/_factory.py", line 114, in create_model
model = create_fn(
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/timm/models/resnet.py", line 1276, in resnet50
return _create_resnet('resnet50', pretrained, **dict(model_args, **kwargs))
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/timm/models/resnet.py", line 547, in _create_resnet
return build_model_with_cfg(ResNet, variant, pretrained, **kwargs)
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/timm/models/_builder.py", line 393, in build_model_with_cfg
load_pretrained(
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/timm/models/_builder.py", line 186, in load_pretrained
state_dict = load_state_dict_from_hf(pretrained_loc)
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/timm/models/_hub.py", line 183, in load_state_dict_from_hf
return safetensors.torch.load_file(cached_safe_file, device="cpu")
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/safetensors/torch.py", line 261, in load_file
result[k] = f.get_tensor(k)
AttributeError: Error instantiating 'nuplan.planning.training.modeling.models.raster_model.RasterModel' : module 'torch' has no attribute 'frombuffer'

This was from your latest submission

tinkei commented 1 year ago

@patk-motional Would you mind sharing the error log of my two latest two submissions? One of them I downgraded timm, and another one I didn't even use timm, but somehow they are still failing.

gianmarco-motional commented 1 year ago

@tinkei Can you share the Dockerfile.submission and entrypoint_submission.sh? If you prefer send me a DM on slack: https://join.slack.com/t/opendrivelab/shared_invite/zt-1uhny7uci-T5~otGGdwUtGo8L1j0~NUA

gianmarco-motional commented 1 year ago

@tinkei one problem is definitely you commenting out this line: # [ -d "/mnt/data" ] && cp -r /mnt/data/nuplan-v1.1/maps/* $NUPLAN_MAPS_ROOT in entrypoint_submission.sh

sindhu-pr commented 1 year ago

@patk-motional I am getting the exact same error: https://github.com/motional/nuplan-devkit/issues/308#issue-1715459085 . My submission details are: {"submitted_image_uri": "937891341272.dkr.ecr.us-east-1.amazonaws.com/nuplan-planning-challenge-1856-participant-team-18335:6151b08b-04a0-4484-af5a-ed404b9a600d"} If you can share the detailed logs, will be helpful for my team.

patk-motional commented 1 year ago

Can you share your submission id? image

sindhu-pr commented 1 year ago

Can you share your submission id? image

284337

Thanks

patk-motional commented 1 year ago

I've pushed to your stderr

sindhu-pr commented 1 year ago

I've pushed to your stderr

Hi, the following is the error:


Could not override 'planner.ml_planner.checkpoint_path'.
To append to your config use +planner.ml_planner.checkpoint_path=/model.ckpt
Key 'checkpoint_path' is not in struct
    full_key: planner.ml_planner.checkpoint_path
    object_type=dict
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

for these errors, where do you think we need to make changes, so that error can be resolved? Where should we set the HYDRA_FULL_ERROR=1 ?

In entrypoint_submission.sh, we have the following line:

conda run -n nuplan --no-capture-output python -u nuplan/planning/script/run_submission_planner.py output_dir=/tmp/ model=<OUR MODEL> planner=ml_planner planner.ml_planner.model_config=\${model} planner.ml_planner.checkpoint_path="${NUPLAN_HOME}/<chkpoint file name>"

tinkei commented 1 year ago

@tinkei one problem is definitely you commenting out this line: # [ -d "/mnt/data" ] && cp -r /mnt/data/nuplan-v1.1/maps/* $NUPLAN_MAPS_ROOT in entrypoint_submission.sh

Thanks! Everything is working now! I commented it out only a few submissions before to debug an issue with my local docker-compose, but forgot to revert it afterwards. Nice catch!

tinkei commented 1 year ago

@sindhu-pr ${NUPLAN_HOME} is /nuplan_devkit in Dockerfile.submission, but your hydra config is pointing to /model.ckpt. I guess for some reason ${NUPLAN_HOME} evaluated to an empty string?

XZHSTAX commented 1 year ago

Thank you for your very useful reply! I have new submission file is {"submitted_image_uri": "937891341272.dkr.ecr.us-east-1.amazonaws.com/nuplan-planning-challenge-1856-participant-team-20999:6f8860cd-d1c7-4e80-92b6-5dde3889c687"} and meet Merged stderr: validation_challenge99.log 2023-05-24 17:15:08,181 : ERROR : Trajectory computation service failed! 2023-05-24 17:15:08,181 : ERROR : Trajectory computation timed out: StatusCode.DEADLINE_EXCEEDED 2023-05-24 17:15:51,685 : ERROR : Trajectory computation service failed! 2023-05-24 17:15:51,686 : ERROR : Trajectory computation timed out: StatusCode.DEADLINE_EXCEEDED

May I know the detailed error message?

I met this problem too. Actually I have success to profile my planner locally but still get this problems.

Then I notice this #298 .As i use nuplan-devkit v1.1, the Dockerfile.submmission is still on laste version. I follow #298 to update Dockerfile.submmission manually, then submit and get Finished.

so there must be some problem if you use last version of Dockerfile.submmission. you can fix it through #298