h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.89k stars 2k forks source link

h2o.H2OFrame.as_data_frame() leads to OSError #16045

Open RoelVerbelen opened 8 months ago

RoelVerbelen commented 8 months ago

H2O version, Operating System and Environment

H2O 3.44.0.3, Python 3.11, Windows 10 (environment also has pyarrow 14.0.2 and polars 0.20.6)

Python Code

Directly taken from the documentation of h2o.H2OFrame.as_data_frame():

import h2o
h2o.init(nthreads = 3)
airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
airlines.as_data_frame()

**OSError***

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)...\h2o-bug.py in line 4
      [2](.../h2o-bug.py?line=1) h2o.init(nthreads = 3)
      [3](.../h2o-bug.py?line=2) airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
----> [4](.../h2o-bug.py?line=3) airlines.as_data_frame()

File [...\h2o\frame.py:1974](.../h2o/frame.py:1974), in H2OFrame.as_data_frame(self, use_pandas, header)
   [1972](.../h2o/frame.py?line=1971) if (can_use_datatable()) or (can_use_polars() and can_use_pyarrow()): # can use multi-thread
   [1973](.../h2o/frame.py?line=1972)     with tempfile.NamedTemporaryFile(suffix=".h2oframe2Convert.csv") as exportFile:
-> [1974](.../h2o/frame.py?line=1973)         h2o.export_file(self, exportFile.name, force=True)
   [1975](.../h2o/frame.py?line=1974)         if can_use_datatable(): # use datatable for multi-thread by default
   [1976](.../h2o/frame.py?line=1975)             return self.convert_with_datatable(exportFile.name)

File [...\h2o\h2o.py:1655](.../h2o/h2o.py:1655), in export_file(frame, path, force, sep, compression, parts, header, quote_header, parallel, format, write_checksum)
   [1648](.../h2o/h2o.py?line=1647) assert_is_type(format, str)
   [1649](.../h2o/h2o.py?line=1648) assert_is_type(write_checksum, bool)
   [1650](.../h2o/h2o.py?line=1649) H2OJob(api("POST /3/Frames/%s/export" % (frame.frame_id), 
   [1651](.../h2o/h2o.py?line=1650)            data={"path": path, "num_parts": parts, "force": force, 
   [1652](.../h2o/h2o.py?line=1651)                  "compression": compression, "separator": ord(sep),
   [1653](.../h2o/h2o.py?line=1652)                  "header": header, "quote_header": quote_header, "parallel": parallel, 
   [1654](.../h2o/h2o.py?line=1653)                  "format": format, "write_checksum": write_checksum}
-> [1655](.../h2o/h2o.py?line=1654)            ),  "Export File").poll()

File [...\h2o\job.py:88](.../h2o/job.py:88), in H2OJob.poll(self, poll_updates)
     [86](.../h2o/job.py?line=85) if self.status == "FAILED":
     [87](.../h2o/job.py?line=86)     if (isinstance(self.job, dict)) and ("stacktrace" in list(self.job)):
---> [88](.../h2o/job.py?line=87)         raise EnvironmentError("Job with key {} failed with an exception: {}\nstacktrace: "
     [89](.../h2o/job.py?line=88)                                "\n{}".format(self.job_key, self.exception, self.job["stacktrace"]))
     [90](.../h2o/job.py?line=89)     else:
     [91](.../h2o/job.py?line=90)         raise EnvironmentError("Job with key %s failed with an exception: %s" % (self.job_key, self.exception))

OSError: Job with key $03017f00000132d4ffffffff$_adf86809d544cbbb6869b32acd464457 failed with an exception: java.lang.RuntimeException: java.io.FileNotFoundException: [C:\Users\ROEL](file:///C:/Users/ROEL)~1.VER\AppData\Local\Temp\tmpmg5yuobe.h2oframe2Convert.csv (The process cannot access the file because it is being used by another process)
stacktrace: 
java.lang.RuntimeException: java.io.FileNotFoundException: [C:\Users\ROEL](file:///C:/Users/ROEL)~1.VER\AppData\Local\Temp\tmpmg5yuobe.h2oframe2Convert.csv (The process cannot access the file because it is being used by another process)
    at water.persist.PersistManager.create(PersistManager.java:793)
    at water.util.FrameUtils$ExportTaskDriver.exportCSVStream(FrameUtils.java:594)
    at water.util.FrameUtils$ExportTaskDriver.exportCSVStream(FrameUtils.java:587)
    at water.util.FrameUtils$ExportTaskDriver.compute2(FrameUtils.java:424)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1689)
    at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
    at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
    at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976)
    at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
    at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Caused by: java.io.FileNotFoundException: [C:\Users\ROEL](file:///C:/Users/ROEL)~1.VER\AppData\Local\Temp\tmpmg5yuobe.h2oframe2Convert.csv (The process cannot access the file because it is being used by another process)
    at java.base/java.io.FileOutputStream.open0(Native Method)
    at java.base/java.io.FileOutputStream.open(FileOutputStream.java:293)
    at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:235)
    at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:123)
    at water.persist.PersistManager.create(PersistManager.java:790)
    ... 9 more

Worked before using h2o 3.42.0.3

Looking at the changelog for changes made to as_data_frame(), I've tried downgrading to version 3.42.0.3 where the above code still works fine for me.

tomasfryda commented 8 months ago

Thank you for taking time to pinpoint the issue. Unfortunately, I don't have Windows machine so I have just 2 untested hypotheses: (1) unexpected character in the path, (2) two processes trying to open the same file (which is supported on unix-like systems but not on Windows).

If it's just the (1), could you provide us with part of the h2o log? I'm interested in log entry ExportFiles processing (SOME_PATH). e.g.

01-31 10:35:21.528 127.0.0.1:54321       3076   5715826-38  INFO water.default: ExportFiles processing (/tmp/iris.csv)

If that is the only problem, you could workaround it by adding something like the following to the top of your script/jupyter notebook (just make sure the path exists).

import tempfile
tempfile.tempdir = "C:\\tmp\\"

If it's the (2), we will need to fix creating the temporary file. It should be a simple thing to fix. I think something like the following would do. cc @wendycwong

--- a/h2o-py/h2o/frame.py
+++ b/h2o-py/h2o/frame.py
@@ -1970,12 +1970,16 @@ class H2OFrame(Keyed, H2ODisplay):
         if can_use_pandas() and use_pandas:
             import pandas
             if (can_use_datatable()) or (can_use_polars() and can_use_pyarrow()): # can use multi-thread
-                with tempfile.NamedTemporaryFile(suffix=".h2oframe2Convert.csv") as exportFile:
+                exportFile = tempfile.NamedTemporaryFile(suffix=".h2oframe2Convert.csv", delete=False)
+                try:
+                    exportFile.close()
                     h2o.export_file(self, exportFile.name, force=True)
                     if can_use_datatable(): # use datatable for multi-thread by default
                         return self.convert_with_datatable(exportFile.name)
                     elif can_use_polars() and can_use_pyarrow():  # polar/pyarrow if datatable is not available
                         return self.convert_with_polars(exportFile.name)
+                finally:
+                    os.unlink(exportFile.name)
             warnings.warn("converting H2O frame to pandas dataframe using single-thread.  For faster conversion using"
                           " multi-thread, install datatable (for Python 3.9 or lower), or polars and pyarrow "
                           "(for Python 3.10 or above).", H2ODependencyWarning)

You can patch you h2o library using that code but it might get little more involved. If it's just the (2) I think we could manage to release the fix in the upcoming major release (likely within the next month). If the problem is in (1) as well we would probably require your help in providing us with the line from the log.

RoelVerbelen commented 8 months ago

Hi @tomasfryda

Thanks for your reponse.

Here is that part of the logs:

01-31 10:18:43.312 127.0.0.1:54321       31396  8557915-20  INFO water.default: ExportFiles processing (C:\Users\ROEL~1.VER\AppData\Local\Temp\tmpmg5yuobe.h2oframe2Convert.csv)
01-31 10:18:43.314 127.0.0.1:54321       31396  8557915-20  WARN water.default: File C:\Users\ROEL~1.VER\AppData\Local\Temp\tmpmg5yuobe.h2oframe2Convert.csv exists, but will be overwritten!
01-31 10:18:43.325 127.0.0.1:54321       31396      FJ-1-7 ERROR water.default: 
java.lang.RuntimeException: java.io.FileNotFoundException: C:\Users\ROEL~1.VER\AppData\Local\Temp\tmpmg5yuobe.h2oframe2Convert.csv (The process cannot access the file because it is being used by another process)

Sounds like it might be (2) rather?

tomasfryda commented 8 months ago

Thank you for the log entry. I think we'll need to fix both or at least make sure the problem is not in (1) as well.

The exception from java has combination of path separators (/ and \) but the log entry contains just \ so I think there is some wrong conversion of path separators in the java backend. It's possible that the issue (1) is only in the error handling part which could explain why the previous version worked but we should look in to it to make sure.

java.lang.RuntimeException: java.io.FileNotFoundException: [C:\Users\ROEL](file:///C:/Users/ROEL)~1.VER\AppData\Local\Temp\tmpmg5yuobe.h2oframe2Convert.csv 
kalaiselvan263 commented 8 months ago

@tomasfryda Do you have any workaround for the fix.

tomasfryda commented 8 months ago

@kalaiselvan263 Not yet. I think the modification I suggested (https://github.com/h2oai/h2o-3/issues/16045#issuecomment-1918835885) would work but I don't have a windows machine to test it on.

You would need to find where the h2o package is installed and navigate to file frame.py. On macOS this gives me the path to the file in python3 which has the h2o installed: import sysconfig; print(sysconfig.get_paths()["purelib"]+"/h2o/frame.py") and I think it would work on Windows as well (you'd just need to change / to \).

If that wouldn't work you can change the exportFile to some predefined path that would not contain any special characters, e.g.:

import random
exportFile = "C:\\tmp\\h2o_tempfile_{}.csv".format(random.randint(0,1e8))

It's not perfect and with this change there could be issues with multiple users trying to do the same thing at the same time but the probability of that is pretty low (1e-8) and on Windows you'd be more likely to end up with the same error The process cannot access the file because it is being used by another process so you'd just have to retry.

MoonCapture commented 7 months ago

I found the key to the problem, you just need to uninstall datatable : )