Open biggoffer opened 3 years ago
Hi @biggoffer, does pip install -U modin[dask]
solve the issue?
let me check...
nopes....it is stuck in loop..
@biggoffer this is an issue with the Dask compute engine. Are you comfortable using Ray? It's experimental in Windows but should still work:
pip install modin[ray]
modin[ray] worked after uninstalling 32 bit python 3.8 and clean install of python 3.7 but now... i m getting this:
C:\Users\adnan\AppData\Local\Programs\Python\Python37\python.exe D:/Python/aprogram/Callion/callion3.py
2021-04-22 00:51:34,595 INFO services.py:1173 -- View the Ray dashboard at http://127.0.0.1:8265
UserWarning: sort_values defaulting to pandas implementation.
To request implementation, send an email to feature_requests@modin.org.
@biggoffer I've added small bit of formatting to your messages here, I hope you don't mind.
As for the last message - this is just a warning that sort_values
is not yet fully optimized in Modin (thus it's not as performant as it should be), but it should not be a blocker.
Hello,
I had the same issue and after following the guide, i download ray. Right after using it, i called Ray like this : os.environ["MODIN_ENGINE"] = "ray"
and i had this error, can you help me?
@EscleineDaher I suggest you check your http_proxy
/https_proxy
environment variables. Ray uses HTTP to interconnect with some of its internal services, and it honours the proxy variables while connecting.
It's not possible to get rid of this part and just stay offline?
I don't exactly understand what you mean here.
What I meant was you can try unsetting http_proxy
and https_proxy
(like export http_proxy=
in bash) before running Modin.
Ray itself won't go to the Internet unless you ask for it, but if I set proxy variables I observe the stacktrace which looks very similar to yours.
I don't understand what you mean, i write export http_proxy
before os.environ["MODIN_ENGINE"] = "ray"
? is that it? I've tried it and i have this error:
runfile('C:/Users/2101550/OneDrive - DAHER/Bureau/Python/Lib/Win32&64 Python/Test++.py', wdir='C:/Users/2101550/OneDrive - DAHER/Bureau/Python/Lib/Win32&64 Python')
File "C:\Users\2101550\OneDrive - DAHER\Bureau\Python\Lib\Win32&64 Python\Test++.py", line 14
export https_proxy
^
SyntaxError: invalid syntax
Apparently my issue should come from my VPN but I can't get rid of it. And now when i want to use dask, the same error pops up. I'm stuck with this issue and i don't know how to make modin work properly.
So, what I propose you do is to write the following in your test++.py
at the very beginning:
import os
os.environ['MODIN_ENGINE'] = 'ray'
os.environ['http_proxy'] = ''
os.environ['https_proxy'] = ''
I've tried the code but the same error persists
In this case it could be firewall (or some antivirus software) which is blocking the connection. I don't think we can fix it from our end...
Both ray and dask are using proxy? There isn't a library to speed up python dataframes without this problem? Because the pandas dataframe is very slow for big files
I'm guessing this is the same problem that I'm facing (but do tell me if I'm wrong and I'll raise it as a separate issue).
I just installed modin as I thought it sounded very promising. I started by following the Getting Started guide. I made a few very minor changes to the sample code:
C:\WINDOWS\system32>python -m pip install -U "ray[default]"
ERROR: Could not find a version that satisfies the requirement ray[default] (from versions: none)
ERROR: No matching distribution found for ray[default]
I changed s3_path
to 'taxi.csv'
in both of the read_csv
lines - it seemed a bit odd to use urlretrieve
to download the CSV to a local folder and then get read_csv
to read it from the URL rather than the pre-downloaded file.
I added the http_proxy
commands suggested above.
Resulting test script:
import modin.pandas as pd
import pandas
import os
import time
# This may take a few minutes to download
import urllib.request
# Commented out after first run
#s3_path = "https://modin-datasets.s3.amazonaws.com/testing/yellow_tripdata_2015-01.csv"
#urllib.request.urlretrieve(s3_path, "taxi.csv")
# Must be done after urlretrieve or the urlretrieve won't work
os.environ['http_proxy'] = ''
os.environ['https_proxy'] = ''
start = time.time()
# Changed from example code to use pre-downloaded file instead of s3_path
pandas_df = pandas.read_csv('taxi.csv', parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"], quoting=3)
end = time.time()
pandas_duration = end - start
print("Time to read with pandas: {} seconds".format(round(pandas_duration, 3)))
start = time.time()
# Changed from example code to use pre-downloaded file instead of s3_path
modin_df = pd.read_csv('taxi.csv', parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"], quoting=3)
end = time.time()
modin_duration = end - start
print("Time to read with Modin: {} seconds".format(round(modin_duration, 3)))
print("Modin is {}x faster than pandas at `read_csv`!".format(round(pandas_duration / modin_duration, 2)))
Pandas reads the CSV file in about 4 to 5 seconds (it varies from run to run). I then get this message on the prompt and then a delay of about 20 seconds:
UserWarning: Dask execution environment not yet initialized. Initializing...
To remove this warning, run the following python code before doing dataframe operations:
from distributed import Client
client = Client()
I then get a whole sequence of messages saying Time to read with pandas: 19.742 seconds
(note "with pandas"???). That quickly disappears off the top of the screen as there are then loads of traceback errors:
Task exception was never retrieved
future: <Task finished name='Task-20' coro=<_wrap_awaitable() done, defined at c:\applications\development\languages\python\lib\asyncio\tasks.py:643> exception=RuntimeError('\n An attempt has been made to start a new process before the\n current process has finished its bootstrapping phase.\n\n This probably means that you are not using fork to start your\n child processes and you have forgotten to use the proper idiom\n in the main module:\n\n if __name__ == \'__main__\':\n freeze_support()\n ...\n\n The "freeze_support()" line can be omitted if the program\n is not going to be frozen to produce an executable.')>
Traceback (most recent call last):
File "c:\applications\development\languages\python\lib\asyncio\tasks.py", line 650, in _wrap_awaitable
return (yield from awaitable.__await__())
File "c:\applications\development\languages\python\lib\site-packages\distributed\core.py", line 274, in _
await self.start()
File "c:\applications\development\languages\python\lib\site-packages\distributed\nanny.py", line 339, in start
response = await self.instantiate()
File "c:\applications\development\languages\python\lib\site-packages\distributed\nanny.py", line 422, in instantiate
result = await self.process.start()
File "c:\applications\development\languages\python\lib\site-packages\distributed\nanny.py", line 692, in start
await self.process.start()
File "c:\applications\development\languages\python\lib\site-packages\distributed\process.py", line 32, in _call_and_set_future
res = func(*args, **kwargs)
File "c:\applications\development\languages\python\lib\site-packages\distributed\process.py", line 186, in _start
process.start()
File "c:\applications\development\languages\python\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "c:\applications\development\languages\python\lib\multiprocessing\context.py", line 336, in _Popen
return Popen(process_obj)
File "c:\applications\development\languages\python\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
prep_data = spawn.get_preparation_data(process_obj._name)
File "c:\applications\development\languages\python\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "c:\applications\development\languages\python\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
Task exception was never retrieved
That keeps on getting echoed until I kill the process with process explorer.
Not a very good introduction to modin so far - given that I can't get past the "Getting Started" tutorial, I've had to go back to plain old pandas and give up on modin.
This is on a Windows 10 PC, running python 3.10.6 and modin 0.15.3 (installed through conda-forge). The Windows 10 PC is behind a zscaler proxy: all external traffic must go through that proxy (but I don't understand why that would affect modin as it's presumably just running with local 127.0.0.1 type access).
@abudden, your issue seems to be related to this topic https://modin.readthedocs.io/en/stable/getting_started/troubleshooting.html#error-when-using-dask-engine-runtimeerror-if-name-main. You should put your code under if __name__ == '__main__':
.
@YarShev That definitely seemed to help, thank you. With the updated script (included below), I now get:
Time to read with pandas: 5.729 seconds
UserWarning: Dask execution environment not yet initialized. Initializing...
To remove this warning, run the following python code before doing dataframe operations:
from distributed import Client
client = Client()
Time to read with Modin: 8.884 seconds
Modin is 0.64x faster than pandas at `read_csv`!
Still not very impressive :-) & it feels like the Getting Started guide could do with an update to clarify all of this.
I think the slow-down is partly down to the Dask initialisation time: if I add those two lines suggested above, the time reduces to about 3 seconds with modin, so about twice as fast as pandas (of course this also involves a few seconds of extra delay to the initialisation time).
Updated script:
import modin.pandas as pd
import pandas
import os
import time
# This may take a few minutes to download
import urllib.request
if not os.path.exists('taxi.csv'):
s3_path = "https://modin-datasets.s3.amazonaws.com/testing/yellow_tripdata_2015-01.csv"
urllib.request.urlretrieve(s3_path, "taxi.csv")
os.environ['http_proxy'] = ''
os.environ['https_proxy'] = ''
if __name__ == "__main__":
start = time.time()
# Changed from example code to use pre-downloaded file instead of s3_path
pandas_df = pandas.read_csv('taxi.csv', parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"], quoting=3)
end = time.time()
pandas_duration = end - start
print("Time to read with pandas: {} seconds".format(round(pandas_duration, 3)))
start = time.time()
# Changed from example code to use pre-downloaded file instead of s3_path
modin_df = pd.read_csv('taxi.csv', parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"], quoting=3)
end = time.time()
modin_duration = end - start
print("Time to read with Modin: {} seconds".format(round(modin_duration, 3)))
print("Modin is {}x faster than pandas at `read_csv`!".format(round(pandas_duration / modin_duration, 2)))
@abudden, thanks for the update. Yes, we should probably add a note to the docs regarding the engine initialization time.
@mvashishtha, can you take this on?
@abudden a note on the proxy issues - this is usually caused by the fact that, while one sets up proxy for HTTP access, one does not make an exclusion for 127.0.0.1
, and so when Ray (or Dask, it doesn't matter) tries to communicate its processes to each other using http://127.0.0.1:<port>/some/api/endpoint
everything falls apart.
One workaround is disable proxies completely, like you did, another one could be defining a NO_PROXY
variable that would allow un-proxied access to localhost
/127.0.0.1
. The real fix should be within the libraries (Ray and Dask in our case) so they would explicitly use un-proxied access for their localhost calls.
@vnlitvinov I can run Modin on Ray in a environment with proxy. Can we close this issue?
I think we have to make sure these problems (with proxies and the need of if __name__ == "__main__"
) are highlighted somewhere around "getting started" section of our docs. I see we already mention the Dask problem in there, but not the proxies.
Stack trace
``` Python 3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 22:45:29) [MSC v.1916 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license()" for more information. >>> ================ RESTART: D:\Python\aprogram\Callion\callion3.py =============== Warning (from warnings module): File "C:\Users\adnan\AppData\Local\Programs\Python\Python38-32\lib\site-packages\modin\error_message.py", line 81 warnings.warn( UserWarning: Dask execution environment not yet initialized. Initializing... To remove this warning, run the following python code before doing dataframe operations: from distributed import Client client = Client() distributed.utils - ERROR - addresses should be strings or tuples, got None Traceback (most recent call last): File "C:\Users\adnan\AppData\Local\Programs\Python\Python38-32\lib\site-packages\distributed\utils.py", line 656, in log_errors yield File "C:\Users\adnan\AppData\Local\Programs\Python\Python38-32\lib\site-packages\distributed\scheduler.py", line 2205, in remove_worker address = self.coerce_address(address) File "C:\Users\adnan\AppData\Local\Programs\Python\Python38-32\lib\site-packages\distributed\scheduler.py", line 4943, in coerce_address raise TypeError("addresses should be strings or tuples, got %r" % (addr,)) TypeError: addresses should be strings or tuples, got None distributed.core - ERROR - addresses should be strings or tuples, got None Traceback (most recent call last): File "C:\Users\adnan\AppData\Local\Programs\Python\Python38-32\lib\site-packages\distributed\core.py", line 513, in handle_comm result = await result File "C:\Users\adnan\AppData\Local\Programs\Python\Python38-32\lib\site-packages\distributed\scheduler.py", line 2205, in remove_worker address = self.coerce_address(address) File "C:\Users\adnan\AppData\Local\Programs\Python\Python38-32\lib\site-packages\distributed\scheduler.py", line 4943, in coerce_address raise TypeError("addresses should be strings or tuples, got %r" % (addr,)) TypeError: addresses should be strings or tuples, got None tornado.application - ERROR - Exception in callbackSystem information
Describe the problem
Source code / logs