trackmania-rl / tmrl

Reinforcement Learning for real-time applications - host of the TrackMania Roborace League
https://pypi.org/project/tmrl
MIT License
501 stars 68 forks source link

Facing issues while running `python -m tmrl --test` #73

Closed Harish-ioc closed 6 months ago

Harish-ioc commented 1 year ago
C:\Windows\System32>python -m tmrl --test
INFO:numexpr.utils:Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
INFO:root:Namespace(server=False, trainer=False, worker=False, test=True, benchmark=False, record_reward=False, check_env=False, no_wandb=False, config={})
INFO:root:11/06/23 16:05:52 server IP: 127.0.0.1
C:\Users\haris\anaconda3\Lib\site-packages\gymnasium\core.py:311: UserWarning: WARN: env.default_action to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do `env.unwrapped.default_action` for environment variables or `env.get_wrapper_attr('default_action')` that will search the reminding wrappers.
  logger.warn(
Exception in thread Thread-2 (__client_thread):
Traceback (most recent call last):
  File "C:\Users\haris\anaconda3\Lib\threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "C:\Users\haris\anaconda3\Lib\threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\haris\anaconda3\Lib\site-packages\tmrl\custom\utils\tools.py", line 41, in __client_thread
    s.connect((self._host, self._port))
ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\haris\anaconda3\Lib\site-packages\tmrl\__main__.py", line 82, in <module>
    main(arguments)
  File "C:\Users\haris\anaconda3\Lib\site-packages\tmrl\__main__.py", line 41, in main
    rw.run_episodes(10000)
  File "C:\Users\haris\anaconda3\Lib\site-packages\tmrl\networking.py", line 648, in run_episodes
    self.run_episode(max_samples_per_episode, train=train)
  File "C:\Users\haris\anaconda3\Lib\site-packages\tmrl\networking.py", line 663, in run_episode
    obs, info = self.reset(collect_samples=False)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\haris\anaconda3\Lib\site-packages\tmrl\networking.py", line 561, in reset
    new_obs, info = self.env.reset()
                    ^^^^^^^^^^^^^^^^
  File "C:\Users\haris\anaconda3\Lib\site-packages\gymnasium\core.py", line 467, in reset
    return self.env.reset(seed=seed, options=options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\haris\anaconda3\Lib\site-packages\gymnasium\core.py", line 515, in reset
    obs, info = self.env.reset(seed=seed, options=options)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\haris\anaconda3\Lib\site-packages\gymnasium\wrappers\order_enforcing.py", line 61, in reset
    return self.env.reset(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\haris\anaconda3\Lib\site-packages\rtgym\envs\real_time_env.py", line 514, in reset
    elt, info = self.interface.reset(seed=seed, options=options)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\haris\anaconda3\Lib\site-packages\tmrl\custom\custom_gym_interfaces.py", line 157, in reset
    data, img = self.grab_data_and_img()
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\haris\anaconda3\Lib\site-packages\tmrl\custom\custom_gym_interfaces.py", line 134, in grab_data_and_img    data = self.client.retrieve_data()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\haris\anaconda3\Lib\site-packages\tmrl\custom\utils\tools.py", line 72, in retrieve_data
    assert t_now - t_start < timeout, f"OpenPlanet stopped sending data since more than {timeout}s."
           ^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: OpenPlanet stopped sending data since more than 10.0s.
yannbouteiller commented 1 year ago

Hi, hmm it looks like your OpenPlanet stopped communicating with tmrl for more than 10 seconds for some reason ? When this happens, the environment throws an exception to avoid corrupting the replay buffer with meaningless samples in case OpenPlanet does eventually respond after more than 10 seconds.

In which situation did you encounter this exception?

yannbouteiller commented 11 months ago

Closing for inactivity, please feel free to reopen if you encounter a similar issue.

PorkDevMode commented 8 months ago

Hey i have quite literally the same exact issue, tried running trackmania and cmd in administrator.

C:\Windows\system32>python -m tmrl --test INFO:root:03/16/24 18:50:41 server IP: 127.0.0.1 Exception in thread Thread-2: Traceback (most recent call last): File "C:\Program Files\Python38\lib\threading.py", line 932, in _bootstrap_inner self.run() File "C:\Program Files\Python38\lib\threading.py", line 870, in run self._target(*self._args, self._kwargs) File "C:\Users\Jojo\AppData\Roaming\Python\Python38\site-packages\tmrl\custom\utils\tools.py", line 41, in __client_thread s.connect((self._host, self._port)) ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it Traceback (most recent call last): File "C:\Program Files\Python38\lib\runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Program Files\Python38\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Users\Jojo\AppData\Roaming\Python\Python38\site-packages\tmrl__main.py", line 84, in main(arguments) File "C:\Users\Jojo\AppData\Roaming\Python\Python38\site-packages\tmrl\main__.py", line 43, in main rw.run_episodes(10000) File "C:\Users\Jojo\AppData\Roaming\Python\Python38\site-packages\tmrl\networking.py", line 670, in run_episodes self.run_episode(max_samples_per_episode, train=train) File "C:\Users\Jojo\AppData\Roaming\Python\Python38\site-packages\tmrl\networking.py", line 688, in run_episode obs, info = self.reset(collect_samples=False) File "C:\Users\Jojo\AppData\Roaming\Python\Python38\site-packages\tmrl\networking.py", line 571, in reset new_obs, info = self.env.reset() File "C:\Users\Jojo\AppData\Roaming\Python\Python38\site-packages\gymnasium\core.py", line 467, in reset return self.env.reset(seed=seed, options=options) File "C:\Users\Jojo\AppData\Roaming\Python\Python38\site-packages\gymnasium\wrappers\order_enforcing.py", line 61, in reset return self.env.reset(kwargs) File "C:\Users\Jojo\AppData\Roaming\Python\Python38\site-packages\rtgym\envs\real_time_env.py", line 514, in reset elt, info = self.interface.reset(seed=seed, options=options) File "C:\Users\Jojo\AppData\Roaming\Python\Python38\site-packages\tmrl\custom\custom_gym_interfaces.py", line 148, in reset data, img = self.grab_data_and_img() File "C:\Users\Jojo\AppData\Roaming\Python\Python38\site-packages\tmrl\custom\custom_gym_interfaces.py", line 125, in grab_data_and_img data = self.client.retrieve_data() File "C:\Users\Jojo\AppData\Roaming\Python\Python38\site-packages\tmrl\custom\utils\tools.py", line 72, in retrieve_data assert t_now - t_start < timeout, f"OpenPlanet stopped sending data since more than {timeout}s." AssertionError: OpenPlanet stopped sending data since more than 10.0s.

yannbouteiller commented 8 months ago

Hi @PorkDevMode , does this happen after a while or are you entirely unable to run the AI at all?

I see this in your traceback:

ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it

This seems to indicate that the localhost TCP connection with OpenPlanet could not be established.

PorkDevMode commented 8 months ago

nope just happens every time

yannbouteiller commented 8 months ago

Did you double check that the OpenPlanet script is running properly? If it is, probably there is some app that is using port 9000. Sadly at the moment there is no way of customizing this port other than changing the OpenPlanet script directly.

david-baez-bravo commented 6 months ago

Hello, I am trying to test the pre-trained ai however the ai never moves and am instead greeted with the time out message.

INFO:root:05/27/24 23:59:38 server IP: 127.0.0.1 Traceback (most recent call last): File "C:\Users\davba\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\davba\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Users\davba\PycharmProjects\pythonProject3\venv\lib\site-packages\tmrl__main.py", line 88, in main(arguments) File "C:\Users\davba\PycharmProjects\pythonProject3\venv\lib\site-packages\tmrl\main__.py", line 43, in main rw.run_episodes(10000) File "C:\Users\davba\PycharmProjects\pythonProject3\venv\lib\site-packages\tmrl\networking.py", line 674, in run_episodes self.run_episode(max_samples_per_episode, train=train) File "C:\Users\davba\PycharmProjects\pythonProject3\venv\lib\site-packages\tmrl\networking.py", line 693, in run_episode obs, info = self.reset(collect_samples=False) File "C:\Users\davba\PycharmProjects\pythonProject3\venv\lib\site-packages\tmrl\networking.py", line 572, in reset new_obs, info = self.env.reset() File "C:\Users\davba\PycharmProjects\pythonProject3\venv\lib\site-packages\gymnasium\core.py", line 467, in reset return self.env.reset(seed=seed, options=options) File "C:\Users\davba\PycharmProjects\pythonProject3\venv\lib\site-packages\gymnasium\wrappers\order_enforcing.py", line 61, in reset return self.env.reset(**kwargs) File "C:\Users\davba\PycharmProjects\pythonProject3\venv\lib\site-packages\rtgym\envs\real_time_env.py", line 514, in reset elt, info = self.interface.reset(seed=seed, options=options) File "C:\Users\davba\PycharmProjects\pythonProject3\venv\lib\site-packages\tmrl\custom\custom_gym_interfaces.py", line 148, in reset data, img = self.grab_data_and_img() File "C:\Users\davba\PycharmProjects\pythonProject3\venv\lib\site-packages\tmrl\custom\custom_gym_interfaces.py", line 125, in grab_data_and_img data = self.client.retrieve_data() File "C:\Users\davba\PycharmProjects\pythonProject3\venv\lib\site-packages\tmrl\custom\utils\tools.py", line 72, in retrieve_data assert t_now - t_start < timeout, f"OpenPlanet stopped sending data since more than {timeout}s."
AssertionError: OpenPlanet stopped sending data since more than 10.0s.

User insights and some troubleshooting I tried:

I don't know if my pc has bad specs but I don't think that's the problem since It runs trackmania smoothly on any fps. I have lowered my fps to 30 but the car never moves. My specs are a 3060, ryzen 7 and 16gb ram. The curious thing is that my pc seems to not even try to run the script, evidenced by its fans not whirring up.

I believe the problem is most likely that the ai never receives information from trackmania. I do have some doubts on how the grab data plugin should be set up. The setup instructions indicated that the plugins folder from tmrl data should be copied into the plugins folder of the openplanet directory. However, this didn't make much sense since that would mean to put a plugins folder inside another plugins folder. The openplanet menu can't load the plugins folder which led me to believe I should copy only the contents of the folder, which I did. However even after restarting, unloading and reloading the grab data plugin and running as administrator either/both my terminal and trackmania, the ai still never moves.

Any insight is appreciated, thank you.

yannbouteiller commented 6 months ago

Hi,

First, it is usually not required to copy this plugin folder manually on Windows. When you install tmrl with pip or execute python -m tmrl --install after deleting the TmrlData folder, it should be copied automatically in the OpenPlanetNext folder, unless the OpenPlanetNext folder lives at an exotic location on your machine (i.e., not in your home directory).

You don't want to copy the Plugins folder inside the OpenPlanetNext/Plugins folder. Instead, you want to copy the content of the Plugins folder inside OpenPlanetNext/Plugins, as you correctly guessed.

If you receive this timeout rightaway, it means that tmrl did not receive anything from the tmrl_grabdata plugin. This communication is handled by this python class. It is instantiated when the TrackMania environment is created, and then the retrieve_data() method is called later. Because you don't receive anything on localhost:9000, the retrieve_data() call hangs for 10s until the timeout error is triggered.

Usually, when the tmrl_grabdata script is not properly running in OpenPlanet, you get another error saying that the localhost connection could not be established because the target machine actively refused it (this error should thown by the connect() call in the __client_thread python thread). If you don't get this error, this means that the python TrackMania environment could connect to something at IP 127.0.0.1 (localhost) and port 9000, which it expects to be the tmrl_grabdata OpenPlanet plugin, but it does not perform any verification to double-check that it is indeed talking to tmrl_grabdata and not to some random program that would have opened a listening server on 127.0.0.1:9000 on your machine at the moment.

So, one hypothesis is that something other than the tmrl_grabdata plugin is listening at address 127.0.0.1:9000 on your machine.

One other hypothesis is that the TrackMania python environment correctly connected to the tmrl_grabdata plugin, but then the plugin crashed for some reason. To check this, can you reload the tmrl_grabdata plugin in the OpenPlanet menu, then open the logs in the OpenPlanet menu, check that it says something like "waiting for incoming connection", and check if you get an error in these logs when launching python -m tmrl --test?

david-baez-bravo commented 6 months ago

Hello

I got that error message you mentioned but also an additional Deprecated warning. See the pasted image:

image

No errors appeared in the log after running the python -m tmrl --test

I also tried checking port 9000 using the netstat -an | grep 9000 command whilst running the python -m tmrl --test command and got the following output:

TCP 127.0.0.1:9000 0.0.0.0:0 LISTENING (appears even when not running the test) TCP 127.0.0.1:9000 127.0.0.1:54921 ESTABLISHED TCP 127.0.0.1:54921 127.0.0.1:9000 ESTABLISHED

I don't know about ports or if this information is even useful however from your response I can see that the program is probably incorrectly using that default listening server. Let me know if there are any other commands I should try, particularly any that would tell me what the listening server belongs to.

yannbouteiller commented 6 months ago

Hmm, it looks like there is a server listening to this port indeed. According to this page, you can do netstat -ano -p tcp and see which PID is doing that, to match it with the original program in the Task Manager.

That being said, since this current fixed 9000 port strategy is causing this type of issues, we should come up with a better version of the tmrl_grabdata plugin that looks for an available port instead of forcing port 9000. I am not entirely sure how to do this with AngelScript but that shouldn't be too hard.

david-baez-bravo commented 6 months ago

I looked up the PID and got the result shown in the next image:

image

If I manage to find an alternative solution which I believe would be uninstalling Ubisoft connect and trying to find some other way to use trackmania, then I will post it here.

Thanks for all the help.

yannbouteiller commented 6 months ago

Wait, UPlay is using this port? 😂

This would be notoriously unfortunate. Maybe it is just the OpenPlanet plugin that is running in the background?

PS: It is normal that the server runs even when not launching python -m tmrl --test, as it runs within OpenPlanet, not within python. What would not be normal would be if it kept running while TrackMania is not.

So I believe your issue is somewhere else. Do you have club access to modify the plugin by any chance?

(Only club members can run unsigned plugins, and it takes me several days/weeks to get each new version signed by the OpenPlanet team)

david-baez-bravo commented 6 months ago

It is quite ironic that Uplay is using the port 😂

It is possible that it’s OpenPlanet however I can no longer check by uninstalling the plugin since I’ve just left and won’t returning until after summer.

However I will try to use tmrl on another computer. I believe mine is cursed. You may very well see me return with the same or some other issue.

Thank you for all your help troubleshooting.

yannbouteiller commented 6 months ago

So, to give more info, if you unzip the openplanet plugin you will find that the code starts like this:

// This plugin ships with the TMRL framework.
// It sends game data to the default TrackMania Gym environments.

// send the content of buf over socket sock:
bool send_memory_buffer(Net::Socket@ sock, MemoryBuffer@ buf)
{
    if (!sock.Write(buf))
    {
        // If this fails, the socket might not be open. Something is wrong!
        print("INFO: Disconnected, could not send data.");
        return false;
    }
    return true;
}

// cast val to a float when necessary and append it to buf:
void append_float(MemoryBuffer@ buf, float val)
{
    buf.Write(val);
}

// entry point:
void Main()
{
    while(true)
    {
        // open localhost TCP connection on port 9000:
        auto sock_serv = Net::Socket();
        if (!sock_serv.Listen("127.0.0.1", 9000))
        {
            print("Could not initiate server socket.");
            return;
        }
        print(Time::Now + ": Waiting for incoming connection...");

        while(!sock_serv.CanRead())
        {
            yield();
        }
        print("Socket can read");
        auto sock = sock_serv.Accept();

        print(Time::Now + ": Accepted incoming connection.");

        while (!sock.CanWrite())
        {
            yield();
        }
        print("Socket can write");
        print(Time::Now + ": Connected!");

                (...)

You were stuck at the "Waiting for incoming connection..." stage for some reason, which seems to imply that tmrl never connected to the server. But if it were the case, you would have seen a "connection refused" error on the python side I believe. Maybe this is related to this deprecation warning about .CanRead()?

david-baez-bravo commented 6 months ago

Yes this is definitely it. Whichever library the CanRead function is from must have updated and made CanRead obsolete/deprecated. The code will probably run on an older version of this library.

I don’t know if one can tell pip to install a specific older version.

Another solution would be to update the script so that it uses one of the two new options: Available() or IsReady, whichever works.

yannbouteiller commented 6 months ago

The tmrl library has not changed its way of communicating with OpenPlanet for a while. Only adapting the OpenPlanet plugin to this deprecation warning by using these other functions may solve this.

I just tested on my PC right now with the new version of OpenPlanet, and I get the same issue, so it is definitely this new OpenPlanet version that is the problem.

yannbouteiller commented 6 months ago

Plugins.zip

Here is a hot fix for club members. It requires enabling the developer signature mode in OpenPlanet.

I'll try to get this signed and make a new tmrl release ASAP.

yannbouteiller commented 6 months ago

OK, the OpenPlanet team was super reactive and signed it last night, I have updated the resources file.

If you encounter this problem:

And you should be good :)