Closed algomaschine closed 2 years ago
@algomaschine If you specified max_runtime_secs
constraint, this is false alarm that I fixed 20th September in H2O-3. It's caused by running out of time before producing a model in the selection step so feel free to ignore the exception.
I'm not sure if it was already released in sparkling water (3.38.0.1-1). If not, the fix should be out with the next release.
Dear @tomasfryda, thanks for your reply. However, there's another one. This one worked fine with the same type of data for long, but now it's getting stuck on some training files, could you give me a hint what might be the rout cause?
Checking whether there is an H2O instance running at http://localhost:54321 . connected.
H2O_cluster_uptime: 1 hour 40 mins H2O_cluster_timezone: Europe/Berlin H2O_data_parsing_timezone: UTC H2O_cluster_version: 3.38.0.1 H2O_cluster_version_age: 1 month and 2 days H2O_cluster_name: H2O_from_python_Administrator_e9kh5n H2O_cluster_total_nodes: 1 H2O_cluster_free_memory: 22.20 Gb H2O_cluster_total_cores: 64 H2O_cluster_allowed_cores: 64 H2O_cluster_status: locked, healthy H2O_connection_url: http://localhost:54321 H2O_connection_proxy: {"http": null, "https": null} H2O_internal_security: False Python_version: 3.7.9 final
Checking whether there is an H2O instance running at http://localhost:54321 . connected.
H2O_cluster_uptime: 1 hour 40 mins H2O_cluster_timezone: Europe/Berlin H2O_data_parsing_timezone: UTC H2O_cluster_version: 3.38.0.1 H2O_cluster_version_age: 1 month and 2 days H2O_cluster_name: H2O_from_python_Administrator_e9kh5n H2O_cluster_total_nodes: 1 H2O_cluster_free_memory: 22.20 Gb H2O_cluster_total_cores: 64 H2O_cluster_allowed_cores: 64 H2O_cluster_status: locked, healthy H2O_connection_url: http://localhost:54321 H2O_connection_proxy: {"http": null, "https": null} H2O_internal_security: False Python_version: 3.7.9 final
Parse progress: |████████████████████████████████████████████████████████████████ (done)| 100%
Traceback (most recent call last):
File ".\auto_model_trainer.py", line 116, in
Closing connection _sid_a9a4 at exit H2O session _sid_a9a4 closed.
Oh sorry, now I understand what "Error: Number of classes is equal to 1." It's because all my label values were zero and no ones for that particular data. Case solved!
Thanks @tomasfryda for covering this issue!
If you specified
max_runtime_secs
constraint, this is false alarm that I fixed 20th September in H2O-3. It's caused by running out of time before producing a model in the selection step so feel free to ignore the exception.
@tomasfryda sorry, could you clarify again pls, what does it mean, like that this (max_runtime_secs=60*30 which I specified) is not enough time to produce a working model? or what do you mean exactly by "running out of time before producing a model in the selection step"? Thanks.
@algomaschine In AutoML, we train models in several steps, the reason for that is time allocation - for each step we set the maximum run time to the remaining time.
First we train models, then grids and, at the end, we do some exploitation steps which pick the best gbm/xgboost and retrains it with learning rate annealing, and here it ran out of time before it trained the model.
So you probably have tens of models trained by automl but during one of the lasts steps it ran out of time. This should not produce the exception as it is expected to run out of time when you specify the time constraint, what wasn't anticipated is starting to train a new model and then running out of time before the model finishes the first iteration (=> no model produced by the step). But the only bad side effect is the exception, so there's nothing to worry about, it's just annoying.
I mean my worry was that maybe the model was unable to produce anything good, because in this case I'm trying to relate astrological data to some unrelated data series (yeah, it's kinda obscure, but that's the whole point of the experiment). And I get this exception for every single data series out of like over 50. Can this also be simply an indication that my distribution of 1s and 0s in the target column (what I'm trying to predict) is way unproportional?
On Fri, Nov 11, 2022, 4:42 PM Tomáš Frýda @.***> wrote:
@algomaschine https://github.com/algomaschine In AutoML, we train models in several steps https://github.com/h2oai/h2o-3/blob/rel-zygmund/h2o-automl/src/main/java/ai/h2o/automl/ModelingPlans.java, the reason for that is time allocation - for each step we set the maximum run time to the remaining time.
First we train models, then grids and, at the end, we do some exploitation steps which pick the best gbm/xgboost and retrains it with learning rate annealing, and here it ran out of time before it trained the model.
So you probably have tens of models trained by automl but during one of the lasts steps it ran out of time. This should not produce the exception as it is expected to run out of time when you specify the time constraint, what wasn't anticipated is starting to train a new model and then running out of time before the model finishes the first iteration (=> no model produced by the step). But the only bad side effect is the exception, so there's nothing to worry about, it's just annoying.
— Reply to this email directly, view it on GitHub https://github.com/h2oai/sparkling-water/issues/2833#issuecomment-1311710122, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE7J6TGTVEYFNHFJHZ3MVATWHZENFANCNFSM6AAAAAARKWVK7Y . You are receiving this because you were mentioned.Message ID: @.***>
@algomaschine I don't think that could be the case but to be sure, try using automl with max_models
constraint instead of max_runtime_secs
. Without the time constraint, there should be no error like that and you can then compare the leaderboards of those runs, to make sure that it the error doesn't influence the performance of the models.
Imbalanced dataset shouldn't produce any issues like this. But maybe it would be worth looking to the functionality we have for balancing the data, e.g., setting balance_classes = True
might help.
@tomasfryda thank you, I understand now. Just another quick question, from time to time I have this issue with the same datasets. Seems like it's fixing itself, cause the models do get generated eventually. Is it just a bug or anything might be wrong with my data?
Thanks, and just a quick question.
C:\Program Files\Python37\lib\site-packages\h2o\job.py:83: UserWarning: Test/Validation dataset column 'str_69' has levels not trained on: ["Pisces"] warnings.warn(w)
I faced this warning, that column was not trained on such value. OK, fair enough, but why is it just WARNING? In my understanding it should be a severe ERROR, because how would it apply the model if there was no such value in the train data or would it just replace this value by some numeric enum average?
On Fri, Nov 11, 2022 at 3:44 PM Tomáš Frýda @.***> wrote:
@algomaschine https://github.com/algomaschine I don't think that could be the case but to be sure, try using automl with max_models constraint instead of max_runtime_secs. Without the time constraint, there should be no error like that and you can then compare the leaderboards https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html#leaderboard of those runs, to make sure that it the error doesn't influence the performance of the models.
Imbalanced dataset shouldn't produce any issues like this. But maybe it would be worth looking to the functionality we have for balancing the data https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html#optional-miscellaneous-parameters, e.g., setting balance_classes = True might help.
— Reply to this email directly, view it on GitHub https://github.com/h2oai/sparkling-water/issues/2833#issuecomment-1311777621, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE7J6TCHM5TXVDM7COWVVQTWHZLS7ANCNFSM6AAAAAARKWVK7Y . You are receiving this because you were mentioned.Message ID: @.***>
@algomaschine Regarding the exception from GLM, are you using the newest version of h2o (3.38.02), if not I'd suggest to upgrade to the newest version. If you are using the newest version, then it's a bug that we didn't fix yet. In that case, would you be able to extract the whole stacktrace from the H2O logs (searching for java.lang.ArrayIndexOutOfBoundsException
in the logs should lead you to it) and posting it here?
UserWarning: Test/Validation dataset column 'str_69' has levels not trained on: ["Pisces"]
It's just a warning because such models can still be useful. I think the handling of the unseen level uses the same logic as missing values. It's user's responsibility to decide how severe the problem is - you can have a variable with thousands of levels but only a few could be useful, in such a case warning is appropriate. If you have five levels and one is missing from the training data then it can be an issue if the variable contains a lot of information.
It's just a warning because such models can still be useful. I think the handling of the unseen level uses the same logic as missing values. It's user's responsibility to decide how severe the problem is
So I've got over 70 years of data (you can see the frequency table above), but for this column 69, the value Pisces only occurs from 2022/5/3 to 2022/8/18 (and that's it! don't ask me why, it's astrology :) In my logic it's better to remove this column at all, because I don't think we have enough samples to even use it for training compared to other values. Would you agree? Also I would normally assume H2O pre-processing would have got rid of that column if I included more data to train, or is it not quite true and it follows the other logic, can I reads more on the pre-processing logic anywhere?
On Mon, Nov 14, 2022 at 9:41 AM Tomáš Frýda @.***> wrote:
@algomaschine https://github.com/algomaschine Regarding the exception from GLM, are you using the newest version of h2o (3.38.02), if not I'd suggest to upgrade to the newest version. If you are using the newest version, then it's a bug that we didn't fix yet. In that case, would you be able to extract the whole stacktrace from the H2O logs (searching for java.lang.ArrayIndexOutOfBoundsException in the logs should lead you to it) and posting it here?
UserWarning: Test/Validation dataset column 'str_69' has levels not trained on: ["Pisces"]
It's just a warning because such models can still be useful. I think the handling of the unseen level uses the same logic as missing values. It's user's responsibility to decide how severe the problem is - you can have a variable with thousands of levels but only a few could be useful, in such a case warning is appropriate. If you have five levels and one is missing from the training data then it can be an issue if the variable contains a lot of information.
— Reply to this email directly, view it on GitHub https://github.com/h2oai/sparkling-water/issues/2833#issuecomment-1313288989, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE7J6TC54OC6DFACMQ2IYZDWIH3JZANCNFSM6AAAAAARKWVK7Y . You are receiving this because you were mentioned.Message ID: @.***>
I would train a model with all the data and look at the variable importance. If that categorical variable is important, I would keep it. What else you can do is to use fold_column and create it so that you would have all the levels distributed evenly across the folds - so you would not encounter this kind of error.
We don't do much preprocessing and handling of missing values is algorithm specific, e.g., GBM.
h2o==3.38.0.1 / Python 3.7.9 / Windows Server 2019 Standard / Firewall OFF completely, no software controlling ports
Gents, here's another issue that I have coming up from time to time - usually after I have models generating for many hours (h2o server is restarted every time I generate a new model). It's solved only by restarting the whole machine.
Checking whether there is an H2O instance running at http://localhost:54321 ..... not found. Attempting to start a local H2O server... ; Java HotSpot(TM) 64-Bit Server VM (build 25.341-b10, mixed mode) Starting server from C:\Program Files\Python37\lib\site-packages\h2o\backend\bin\h2o.jar Ice root: C:\Users\ADMINI~1\AppData\Local\Temp\2\tmpv7_703by JVM stdout: C:\Users\ADMINI~1\AppData\Local\Temp\2\tmpv7_703by\h2o_Administrator_started_from_python.out JVM stderr: C:\Users\ADMINI~1\AppData\Local\Temp\2\tmpv7_703by\h2o_Administrator_started_from_python.err Server is running at http://127.0.0.1:54321 Connecting to H2O server at http://127.0.0.1:54321 ... successful.
H2O_cluster_uptime: 03 secs H2O_cluster_timezone: Europe/Berlin H2O_data_parsing_timezone: UTC H2O_cluster_version: 3.38.0.1 H2O_cluster_version_age: 1 month and 27 days H2O_cluster_name: H2O_from_python_Administrator_xbepc0 H2O_cluster_total_nodes: 1 H2O_cluster_free_memory: 26.67 Gb H2O_cluster_total_cores: 64 H2O_cluster_allowed_cores: 64 H2O_cluster_status: locked, healthy H2O_connection_url: http://127.0.0.1:54321 H2O_connection_proxy: {"http": null, "https": null} H2O_internal_security: False Python_version: 3.7.9 final
Checking whether there is an H2O instance running at http://localhost:54321 . connected.
H2O_cluster_uptime: 10 secs H2O_cluster_timezone: Europe/Berlin H2O_data_parsing_timezone: UTC H2O_cluster_version: 3.38.0.1 H2O_cluster_version_age: 1 month and 27 days H2O_cluster_name: H2O_from_python_Administrator_xbepc0 H2O_cluster_total_nodes: 1 H2O_cluster_free_memory: 26.63 Gb H2O_cluster_total_cores: 64 H2O_cluster_allowed_cores: 64 H2O_cluster_status: locked, healthy H2O_connection_url: http://localhost:54321 H2O_connection_proxy: {"http": null, "https": null} H2O_internal_security: False Python_version: 3.7.9 final
Parse progress: |████████████████████████████████████████████████████████████████ (done)| 100% AutoML progress: |▋ | 1% 17:58:19.162: AutoML: XGBoost is not available; skipping it.
Failed polling AutoML progress log: Unexpected HTTP error: HTTPConnectionPool(host='localhost', port=54321): Max retries exceeded with url: /99/AutoML/train_SELL_target_Close_th_9.1per_2022-11-15.csv-seed-1668531493981-seed-1668531493981@@target_Close_th_9_1per?verbosity=warn (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000019581C59A08>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')) Job request failed Unexpected HTTP error: HTTPConnectionPool(host='localhost', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000195FCBF4FC8>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='localhost', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000019581C53E08>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')), will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe , will retry after 3s. Failed polling AutoML progress log: No AutoML instance with id train_SELL_target_Close_th_9.1per_2022-11-15.csv-seed-1668531493981-seed-1668531493981@@target_Close_th_9_1per. ERROR TRAINING .\train-test\train_SELL_target_Close_th_9.1per_2022-11-15.csv Traceback (most recent call last): File ".\auto_model_trainer.py", line 165, in train_by_data aml.train(y = y, training_frame = train, leaderboard_frame = test) File "C:\Program Files\Python37\lib\site-packages\h2o\automl_estimator.py", line 679, in train self._job.poll(poll_updates=poll_updates) File "C:\Program Files\Python37\lib\site-packages\h2o\job.py", line 71, in poll pb.execute(self._refresh_job_status, progress_monitor_fn=ft.partial(poll_updates, self)) File "C:\Program Files\Python37\lib\site-packages\h2o\utils\progressbar.py", line 187, in execute res = progress_fn() # may raise StopIteration File "C:\Program Files\Python37\lib\site-packages\h2o\job.py", line 138, in _refresh_job_status jobs = self._query_job_status_safe() File "C:\Program Files\Python37\lib\site-packages\h2o\job.py", line 134, in _query_job_status_safe raise last_err File "C:\Program Files\Python37\lib\site-packages\h2o\job.py", line 116, in _query_job_status_safe result = h2o.api("GET /3/Jobs/%s" % self.job_key) File "C:\Program Files\Python37\lib\site-packages\h2o\h2o.py", line 124, in api return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to) File "C:\Program Files\Python37\lib\site-packages\h2o\backend\connection.py", line 498, in request return self._process_response(resp, save_to) File "C:\Program Files\Python37\lib\site-packages\h2o\backend\connection.py", line 852, in _process_response raise H2OResponseError(data) h2o.exceptions.H2OResponseError: Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_a5a0dc8641446840b41a21d91bf94ebe
Closing connection _sid_984b at exit H2O session _sid_984b closed. Closing connection _sid_96f2 at exit H2O session _sid_96f2 closed.
@algomaschine Logs from the h2o backend would be useful to determine the cause of the issue. My guess would be that the backend got killed or just is unresponsive due to e.g. out of memory issue.
Location of the logs is printed out during initialization like this:
JVM stdout: C:\Users\ADMINI1\AppData\Local\Temp\2\tmpv7_703by\h2o_Administrator_started_from_python.out
JVM stderr: C:\Users\ADMINI~1\AppData\Local\Temp\2\tmpv7_703by\h2o_Administrator_started_from_python.err
Gents, I face this issue sometimes when generating model. Simple Python script.
s
03:56:42.592: GBM_lr_annealing_selection_AutoML_14_20221021_25801 [GBM lr_annealing] failed: water.exceptions.H2OIllegalArgumentException: Can only convert jobs producing a single Model or ModelContainer.