Error (exit code 1): Inference failed

shuishida commented 2 years ago

Hello - I am trying to upload this behavioural-cloning-baseline without any modifications to AICrowd to see if it works, but I keep getting this error "inference failed" without further error descriptions.

I have followed the instructions in https://github.com/minerllabs/basalt_2022_competition_submission_template/ but changing the github repository to be this one instead of the submission template. When I tried the same with the submission template it worked so this probably has something to do with the differences between the submission template and this repo.

Could you direct me please to how I can diagnose this problem further?

Miffyli commented 2 years ago

Hey. Did you train and upload the trained models as part of the submission? To submit this baseline solution, you need to locally train the models, add them to the git, upload them to the server and then do the submission.

shuishida commented 2 years ago

Ohh that explains! Thank you.

shuishida commented 2 years ago

Do I understand correctly though, that during the actual competition, we are not allowed to do this? (the trained models take up 280MB each, which is way over the 30MB limit, so we have to do the training on the competition server right?) How can I submit this behavioural cloning baseline like how I would submit in the actual competition?

Miffyli commented 2 years ago

Ah, so you need to (and can) upload a big model as part of your submission: this is ran on the server right away and result is uploaded on the leaderboard. Then during training phase (after submissions close), we will remove all big files and presubmitted models, and run the training code instead to train the models, and test those models to see if the resulting behaviour is similar as when you originally submitted the submission.

shuishida commented 2 years ago

I see! Thank you for the clarification!

shuishida commented 2 years ago

Wait - I can't push these large files via git because it exceeds GitHub's file size limit of 100MB.. What should I do?

Miffyli commented 2 years ago

You need to use git LFS for that. See these instructions.

shuishida commented 2 years ago

Hello again! I was able to upload the files via LFS, and the BuildVillage challenge succeeded, but the other tasks failed. I saw that there was an update to this behavioural cloning baseline repo yesterday (where there was a change in the model used for the BuildVillage task) so I pulled that change, and that made all tasks fail. Judging from this, I was speculating that the 2x.model is not available in the data directory of the server, while the foundation-model-1x.model is. Is this the case?

Do I understand correctly that I don't need to git add these VPT models, since they would be provided in the evaluation server?

Related to this, can we use data/VPT_models/... in the code, or are we supposed to do os.path.join(os.environ.get('MINERL_DATA_ROOT'), "VPT_models", "...") ? In the examples given, the data directory seems to be hardcoded as data/VPT_models/... but in the README it looks like the data directory on the server is at a different path, given as MINERL_DATA_ROOT so I was confused.

shuishida commented 2 years ago

By the way, I tried replacing all instances of 2x.model with foundation-model-1x.model and the inference worked:)

shuishida commented 2 years ago

Ah - I guess the failure was because I trained my models a couple of days ago when the train.py was using the foundation-model-1x.model as a base, but since then there was an update and now it is assumed that the weights are for 2x.model but I still had weights for the foundation model. :P

shuishida commented 2 years ago

So I guess the only question I have is:

Can we use data/VPT_models/... in the code, or are we supposed to do os.path.join(os.environ.get('MINERL_DATA_ROOT'), "VPT_models", "...") ? In the examples given, the data directory seems to be hardcoded as data/VPT_models/... but in the README it looks like the data directory on the server is at a different path, given as MINERL_DATA_ROOT so I was confused.

shuishida commented 2 years ago

Oh! And upload failed with the error message Upload: Error (exit code 1): open /mainctrfs/outputs/score.json: no such file or directory

Miffyli commented 2 years ago

Yup the code was updated! Sorry for the confusion ^^. I should have maybe defaulted it to the original behaviour. My mistake.

Can we use data/VPT_models/... in the code

I would recommend using the MINERL_DATA_ROOT variable: it allows us on the backend side change the data location if for some reason we need to. Buuuut generally you can assume data is the location where things are stored.

Upload: Error (exit code 1): open /mainctrfs/outputs/score.json: no such file or directory

Hmm where/when did this happen? I do not recognize the error

shuishida commented 2 years ago

Thank you so much for your responses - really appreciate all your hard work!!

This specific error happened after the inference stage and during the upload stage. At some point it said upload in progress and I was able to see the green circles under the "Evaluation Status", but then it failed with this error and the green circles also disappeared.

Miffyli commented 2 years ago

@shuishida Ah, this might have been a random crash which occured as we were tuning the system... Feel free to try submitting again, although for a debug submission, it did pass; it just failed at the final internal steps 🙃 . If you can share link to the submission issue page I can dig deeper into things.

(PS: we generally answer faster on our Discord server, but I will continue to monitor these github pages!)

minerllabs / basalt-2022-behavioural-cloning-baseline

Error (exit code 1): Inference failed #4