oceanmodeling / ondemand-storm-workflow

Other
2 stars 1 forks source link

Incomplete post-processing step on Hercules #43

Closed FariborzDaneshvar-NOAA closed 6 months ago

FariborzDaneshvar-NOAA commented 6 months ago

@SorooshMani-NOAA The post-processing step of the workflow on Hercules is failing at the sensitivity plot section with timeout error (see below) Should increasing the time will resolve this issue, or do you have any other recommendation? Thanks

[2024-02-13 03:52:28,447] surrogate       INFO    : saving sensitivities to "/work2/noaa/nos-surge/shared/nhc_hurricanes/sandy_2012_1940d4cc-ea2e-4752-820d-0bb2fdf37b75/setup/ensemble.dir/analyze/linear_k1_p1_n0.025/sensitivities.nc"
/opt/conda/envs/prep/lib/python3.9/site-packages/ensembleperturbation/plotting/nodes.py:220: FutureWarning: The geopandas.dataset module is deprecated and will be removed in GeoPandas 1.0. You can get the original 'naturalearth_lowres' data from https://www.naturalearthdata.com/downloads/110m-cultural-vectors/.
  countries = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
/opt/conda/envs/prep/lib/python3.9/site-packages/stormevents/nhc/track.py:173: UserWarning: It is recommended to specify the file_deck and/or advisories when reading from file
  warnings.warn(
Traceback (most recent call last):
  File "/opt/conda/envs/prep/lib/python3.9/urllib/request.py", line 1346, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/opt/conda/envs/prep/lib/python3.9/http/client.py", line 1285, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/opt/conda/envs/prep/lib/python3.9/http/client.py", line 1331, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/opt/conda/envs/prep/lib/python3.9/http/client.py", line 1280, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/opt/conda/envs/prep/lib/python3.9/http/client.py", line 1040, in _send_output
    self.send(msg)
  File "/opt/conda/envs/prep/lib/python3.9/http/client.py", line 980, in send
    self.connect()
  File "/opt/conda/envs/prep/lib/python3.9/http/client.py", line 1447, in connect
    super().connect()
  File "/opt/conda/envs/prep/lib/python3.9/http/client.py", line 946, in connect
    self.sock = self._create_connection(
  File "/opt/conda/envs/prep/lib/python3.9/socket.py", line 844, in create_connection
    raise err
  File "/opt/conda/envs/prep/lib/python3.9/socket.py", line 832, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/prep/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/prep/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/scripts/analyze_ensemble.py", line 409, in <module>
    main(parser.parse_args())
  File "/scripts/analyze_ensemble.py", line 49, in main
    analyze(tracks_dir, ensemble_dir / 'analyze')
  File "/scripts/analyze_ensemble.py", line 56, in analyze
    _analyze(tracks_dir, analyze_dir, mann_coef)
  File "/scripts/analyze_ensemble.py", line 320, in _analyze
    plot_sensitivities(
  File "/opt/conda/envs/prep/lib/python3.9/site-packages/ensembleperturbation/plotting/surrogate.py", line 298, in plot_sensitivities
    plot_node_map(
  File "/opt/conda/envs/prep/lib/python3.9/site-packages/ensembleperturbation/plotting/nodes.py", line 235, in plot_node_map
    storm.data['longitude'], storm.data['latitude'], 'k--', label=storm.name,
  File "/opt/conda/envs/prep/lib/python3.9/site-packages/stormevents/nhc/track.py", line 210, in name
    storms = nhc_storms(year=self.year)
  File "/opt/conda/envs/prep/lib/python3.9/site-packages/stormevents/nhc/storms.py", line 64, in nhc_storms
    storms = pandas.read_csv(
  File "/opt/conda/envs/prep/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1024, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/opt/conda/envs/prep/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 618, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/opt/conda/envs/prep/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1618, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/opt/conda/envs/prep/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1878, in _make_engine
    self.handles = get_handle(
  File "/opt/conda/envs/prep/lib/python3.9/site-packages/pandas/io/common.py", line 728, in get_handle
    ioargs = _get_filepath_or_buffer(
  File "/opt/conda/envs/prep/lib/python3.9/site-packages/pandas/io/common.py", line 384, in _get_filepath_or_buffer
    with urlopen(req_info) as req:
  File "/opt/conda/envs/prep/lib/python3.9/site-packages/pandas/io/common.py", line 289, in urlopen
    return urllib.request.urlopen(*args, **kwargs)
  File "/opt/conda/envs/prep/lib/python3.9/urllib/request.py", line 214, in urlopen
    return opener.open(url, data, timeout)
  File "/opt/conda/envs/prep/lib/python3.9/urllib/request.py", line 517, in open
    response = self._open(req, data)
  File "/opt/conda/envs/prep/lib/python3.9/urllib/request.py", line 534, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/opt/conda/envs/prep/lib/python3.9/urllib/request.py", line 494, in _call_chain
    result = func(*args)
  File "/opt/conda/envs/prep/lib/python3.9/urllib/request.py", line 1389, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "/opt/conda/envs/prep/lib/python3.9/urllib/request.py", line 1349, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>
ERROR conda.cli.main_run:execute(124): `conda run python -m analyze_ensemble --ensemble-dir /work2/noaa/nos-surge/shared/nhc_hurricanes/sandy_2012_1940d4cc-ea2e-4752-820d-0bb2fdf37b75/setup/ensemble.dir/ --tracks-dir /work2/noaa/nos-surge/shared/nhc_hurricanes/sandy_2012_1940d4cc-ea2e-4752-820d-0bb2fdf37b75/setup/ensemble.dir//track_files` failed. (See above for error)
SorooshMani-NOAA commented 6 months ago

@FariborzDaneshvar-NOAA this is related to not having internet access again! This happens in these parts of the plotting code:

The storm is passed to this code as path to the track file created here: https://github.com/oceanmodeling/ondemand-storm-workflow/blob/269f744ebf05a831f15ed7851c57caf75ecc2dbe/singularity/prep/files/analyze_ensemble.py#L131

The strange thing is when I try running this manually using the singularity container on Hercules, I don't get either of the two connection time-outs. Above errors shows one timeout for getting naturalearth_lowres dataset and another for getting the name of the storm.

The natural earth is cached in the container when I try to use it, and the storm name is correctly returned from the track file, so I don't know why we actually get into the request part of the error!

From my console on Hercules compute node (no internet) and with singularity container I get:

>>> import stormevents
>>> st = stormevents.nhc.track.VortexTrack.from_file('./original.22')
>>> st.name
'SANDY'

And for the shape dataset:

>>> gpd.datasets.get_path('naturalearth_lowres')
'/opt/conda/envs/prep/lib/python3.9/site-packages/geopandas/datasets/naturalearth_lowres/naturalearth_lowres.shp'
>>> gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
         pop_est  ...                                           geometry
0       889953.0  ...  MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1     58005463.0  ...  POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2       603253.0  ...  POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
3     37589262.0  ...  MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
4    328239523.0  ...  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...
..           ...  ...                                                ...
172    6944975.0  ...  POLYGON ((18.82982 45.90887, 18.82984 45.90888...
173     622137.0  ...  POLYGON ((20.07070 42.58863, 19.80161 42.50009...
174    1794248.0  ...  POLYGON ((20.59025 41.85541, 20.52295 42.21787...
175    1394973.0  ...  POLYGON ((-61.68000 10.76000, -61.10500 10.890...
176   11062113.0  ...  POLYGON ((30.83385 3.50917, 29.95350 4.17370, ...

[177 rows x 6 columns]

in both cases I removed any warning messages!

SorooshMani-NOAA commented 6 months ago

Just for the sake of documentation:

I'll look into the no name track file issue to resolve this ticket.

SorooshMani-NOAA commented 6 months ago

Official tracks don't have name, so we can just put the name in our prep script. For the storms you already ran, you can add the storm name to the track files manually. Please list all completed runs you have that needs combining, so that I can fix the track files, then I'll fix the setup script too to take care of that automatically.

FariborzDaneshvar-NOAA commented 6 months ago

Great! thanks. You might want to test it with Michael which has smaller files compared to Sandy. Here are paths to some completed runs:

SorooshMani-NOAA commented 6 months ago

I added a fix in bugfix/minor branch. Since we work off of that branch, it's OK to close this if the fix works

SorooshMani-NOAA commented 6 months ago

@FariborzDaneshvar-NOAA if this is resolved please close this ticket, thanks

FariborzDaneshvar-NOAA commented 6 months ago

@SorooshMani-NOAA for some cases, it was still getting stuck in the middle of the process or got memory error (even on the compute node)! However a new run with updated workflow went through and completed the processes that did not complete with the first approach in most cases, but there are still some cases that their post-processes have failed while calculating sensitivities, percentiles, or probabilities!

SorooshMani-NOAA commented 6 months ago

Thank you @FariborzDaneshvar-NOAA. As we discussed, let's keep this open then until we figure out the failed cases. Hopefully exclusive nodes will resolve this. Then we'll create another ticket to write an optimized combine and analyze scripts from scratch to address the post bottleneck.

FariborzDaneshvar-NOAA commented 6 months ago

@SorooshMani-NOAA uncompleted analysis for high leadtimes of Sandy and Dorian also completed with dask and exclusive flag! Thanks for your help! feel free to close this issue.

SorooshMani-NOAA commented 6 months ago

OK, thanks for trying different things to make it work. We have to start addressing the bottlenecks in post sometime soon. I'll close this for now. Thanks again!