pytroll / trollflow2

Next generation Trollflow. Trollflow is for batch-processing satellite data using Satpy
https://trollflow2.readthedocs.org/
GNU General Public License v3.0
9 stars 15 forks source link

Trollflow2 stops processing if plugin gets stuck (messages go unprocessed, no error message in log) #100

Closed gerritholl closed 3 years ago

gerritholl commented 3 years ago

Describe the bug

I configured trollflow2 to call the plugin trollflow2.plugins.covers. When this one gets stuck, which in this case happens due to the pyresample bug https://github.com/pytroll/pyresample/issues/329, trollflow2 does not recover but stops processing forever. The cause is not clear from log messages and it took me substantial detective work with pyrasite to identify the root cause in the first place. For example, in our test operational chain, from 2021-02-07 at 12:58 UTC, in a test production chain using latest trollflow2 master (more specifically the branch corresponding to #97), trollflow2 was no longer processing incoming messages. There are no warning or error messages that may explain this. At 12:55:03, two messages were received and processed. At 12:58:05, two messages were received and were not processed, nor were any subsequent messages until I restarted trollflow2 the following day. Since restart, messages have been processed again.

[DEBUG: 2021-02-07 12:55:03 : Listener] New message received: pytroll://collection/metop/avhrr/eumetcast collection pytroll@oflks333.dwd.de 2021-02-07T12:55:03.001058 v1.01 application/json {"path": "", "start_time": "2021-02-07T11:07:03", "end_time": "2021-02-07T11:16:03", "processing_time": "2021-02-07T11:39:48", "sensor": "avhrr/3", "platform_name": "Metop-A", "collection_area_id": "nparktis1km", "collection": [{"start_time": "2021-02-07T11:07:03", "end_time": "2021-02-07T11:10:03", "uri": "/data/pytroll/IN/EMC_AVR/AVHR_xxx_1B_M02_20210207110703Z_20210207111003Z_N_O_20210207113948Z", "uid": "AVHR_xxx_1B_M02_20210207110703Z_20210207111003Z_N_O_20210207113948Z"}, {"start_time": "2021-02-07T11:10:03", "end_time": "2021-02-07T11:13:03", "uri": "/data/pytroll/IN/EMC_AVR/AVHR_xxx_1B_M02_20210207111003Z_20210207111303Z_N_O_20210207114013Z", "uid": "AVHR_xxx_1B_M02_20210207111003Z_20210207111303Z_N_O_20210207114013Z"}, {"start_time": "2021-02-07T11:13:03", "end_time": "2021-02-07T11:16:03", "uri": "/data/pytroll/IN/EMC_AVR/AVHR_xxx_1B_M02_20210207111303Z_20210207111603Z_N_O_20210207114037Z", "uid": "AVHR_xxx_1B_M02_20210207111303Z_20210207111603Z_N_O_20210207114037Z"}]}
[DEBUG: 2021-02-07 12:55:03 : Listener] New message received: pytroll://collection/metop/avhrr/eumetcast collection pytroll@oflks333.dwd.de 2021-02-07T12:55:03.002003 v1.01 application/json {"path": "", "start_time": "2021-02-07T11:13:03", "end_time": "2021-02-07T11:16:03", "processing_time": "2021-02-07T11:40:37", "sensor": "avhrr/3", "platform_name": "Metop-A", "collection_area_id": "npgroenland2200m", "collection": [{"start_time": "2021-02-07T11:13:03", "end_time": "2021-02-07T11:16:03", "uri": "/data/pytroll/IN/EMC_AVR/AVHR_xxx_1B_M02_20210207111303Z_20210207111603Z_N_O_20210207114037Z", "uid": "AVHR_xxx_1B_M02_20210207111303Z_20210207111603Z_N_O_20210207114037Z"}]}
[INFO: 2021-02-07 12:55:03 : trollflow2.launcher] Distributed processing not configured, using default scheduler
[INFO: 2021-02-07 12:55:03 : trollflow2.plugins] Generating scene
(...)
[DEBUG: 2021-02-07 12:55:48 : trollflow2.launcher] Cleaning up
[INFO: 2021-02-07 12:55:49 : trollflow2.launcher] All files produced nominally in 0:00:46.061789.
[INFO: 2021-02-07 12:55:49 : trollflow2.launcher] Distributed processing not configured, using default scheduler
[INFO: 2021-02-07 12:55:49 : trollflow2.plugins] Generating scene
(...)
[DEBUG: 2021-02-07 12:56:25 : trollflow2.launcher] Cleaning up
[INFO: 2021-02-07 12:56:25 : trollflow2.launcher] All files produced nominally in 0:00:36.357745.
[DEBUG: 2021-02-07 12:58:05 : Listener] New message received: pytroll://collection/metop/avhrr/eumetcast collection pytroll@oflks333.dwd.de 2021-02-07T12:58:05.614928 v1.01 application/json {"path": "", "start_time": "2021-02-07T11:16:03", "end_time": "2021-02-07T11:19:03", "processing_time": "2021-02-07T12:55:34", "sensor": "avhrr/3", "platform_name": "Metop-A", "collection_area_id": "nparktis1km", "collection": [{"start_time": "2021-02-07T11:16:03", "end_time": "2021-02-07T11:19:03", "uri": "/data/pytroll/IN/EMC_AVR/AVHR_xxx_1B_M02_20210207111603Z_20210207111903Z_N_O_20210207125534Z", "uid": "AVHR_xxx_1B_M02_20210207111603Z_20210207111903Z_N_O_20210207125534Z"}]}
[DEBUG: 2021-02-07 12:58:05 : Listener] New message received: pytroll://collection/metop/avhrr/eumetcast collection pytroll@oflks333.dwd.de 2021-02-07T12:58:05.615918 v1.01 application/json {"path": "", "start_time": "2021-02-07T11:16:03", "end_time": "2021-02-07T11:19:03", "processing_time": "2021-02-07T12:55:34", "sensor": "avhrr/3", "platform_name": "Metop-A", "collection_area_id": "npgroenland2200m", "collection": [{"start_time": "2021-02-07T11:16:03", "end_time": "2021-02-07T11:19:03", "uri": "/data/pytroll/IN/EMC_AVR/AVHR_xxx_1B_M02_20210207111603Z_20210207111903Z_N_O_20210207125534Z", "uid": "AVHR_xxx_1B_M02_20210207111603Z_20210207111903Z_N_O_20210207125534Z"}]}
[DEBUG: 2021-02-07 12:58:43 : Listener] New message received: pytroll://collection/metop/avhrr/eumetcast collection pytroll@oflks333.dwd.de 2021-02-07T12:58:43.487091 v1.01 application/json {"path": "", "start_time": "2021-02-07T11:19:03", "end_time": "2021-02-07T11:22:03", "processing_time": "2021-02-07T12:56:47", "sensor": "avhrr/3", "platform_name": "Metop-A", "collection_area_id": "nparktis1km", "collection": [{"start_time": "2021-02-07T11:19:03", "end_time": "2021-02-07T11:22:03", "uri": "/data/pytroll/IN/EMC_AVR/AVHR_xxx_1B_M02_20210207111903Z_20210207112203Z_N_O_20210207125647Z", "uid": "AVHR_xxx_1B_M02_20210207111903Z_20210207112203Z_N_O_20210207125647Z"}]}
[DEBUG: 2021-02-07 12:58:43 : Listener] New message received: pytroll://collection/metop/avhrr/eumetcast collection pytroll@oflks333.dwd.de 2021-02-07T12:58:43.489299 v1.01 application/json {"path": "", "start_time": "2021-02-07T11:19:03", "end_time": "2021-02-07T11:22:03", "processing_time": "2021-02-07T12:56:47", "sensor": "avhrr/3", "platform_name": "Metop-A", "collection_area_id": "npgroenland2200m", "collection": [{"start_time": "2021-02-07T11:19:03", "end_time": "2021-02-07T11:22:03", "uri": "/data/pytroll/IN/EMC_AVR/AVHR_xxx_1B_M02_20210207111903Z_20210207112203Z_N_O_20210207125647Z", "uid": "AVHR_xxx_1B_M02_20210207111903Z_20210207112203Z_N_O_20210207125647Z"}]}
[DEBUG: 2021-02-07 13:01:03 : Listener] New message received: pytroll://collection/metop/avhrr/eumetcast collection pytroll@oflks333.dwd.de 2021-02-07T13:01:03.001442 v1.01 application/json {"path": "", "start_time": "2021-02-07T11:13:03", "end_time": "2021-02-07T11:22:03", "processing_time": "2021-02-07T11:40:37", "sensor": "avhrr/3", "platform_name": "Metop-A", "collection_area_id": "euro_ma", "collection": [{"start_time": "2021-02-07T11:13:03", "end_time": "2021-02-07T11:16:03", "uri": "/data/pytroll/IN/EMC_AVR/AVHR_xxx_1B_M02_20210207111303Z_20210207111603Z_N_O_20210207114037Z", "uid": "AVHR_xxx_1B_M02_20210207111303Z_20210207111603Z_N_O_20210207114037Z"}, {"start_time": "2021-02-07T11:16:03", "end_time": "2021-02-07T11:19:03", "uri": "/data/pytroll/IN/EMC_AVR/AVHR_xxx_1B_M02_20210207111603Z_20210207111903Z_N_O_20210207125534Z", "uid": "AVHR_xxx_1B_M02_20210207111603Z_20210207111903Z_N_O_20210207125534Z"}, {"start_time": "2021-02-07T11:19:03", "end_time": "2021-02-07T11:22:03", "uri": "/data/pytroll/IN/EMC_AVR/AVHR_xxx_1B_M02_20210207111903Z_20210207112203Z_N_O_20210207125647Z", "uid": "AVHR_xxx_1B_M02_20210207111903Z_20210207112203Z_N_O_20210207125647Z"}]}

There was no traceback reported around 12:58. The most recent traceback in trollflow2.log was reported at 10:00:24 and related to channel 3B (KeyError: "Could not load DataID(name='3b', ...) from any provided files). Since this was nearly three hours before processing of incoming messages stopped and I have had this message several times per day on 6 and 7 February, I believe this is unrelated.

I am running all processing via supervisord. The supervisord logfile does not contain any additional information or anything unusual around the indicated time. At the time, supervisorctl showed all processes as running with the same uptime.

I had started trollflow2 with:

/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/bin/satpy_launcher.py -n localhost /opt/pytroll/pytroll_inst/config/trollflow2.yaml

No other Pytroll packages in the chain reported any tracebacks on 2021-08-07.

Apart from the tracebacks reported above, no messages were reported with loglevel ERROR. There were some messages reported with loglevel WARNING, but I believe those were also unrelated as they were logged every time:

[WARNING: 2021-02-07 12:50:55 : satpy.scene] The following datasets were not created and may require resampling to be generated: DataID(name='night_fog'), DataID(name='day_microphysics'), DataID(name='cloudtop'), DataID(name='night_microphysics'), DataID(name='overview'), DataID(name='ir108_3d'), DataID(name='ir_cloud_day')
[WARNING: 2021-02-07 12:50:56 : pyspectral.rsr_reader] Inconsistent instrument/satellite input - instrument set to avhrr/3
[WARNING: 2021-02-07 12:50:56 : pyspectral.rsr_reader] Inconsistent instrument/satellite input - instrument set to avhrr/3
[WARNING: 2021-02-07 12:50:57 : pyspectral.rsr_reader] Inconsistent instrument/satellite input - instrument set to avhrr/3
[WARNING: 2021-02-07 12:55:03 : satpy.readers.yaml_reader] No filenames found for reader: avhrr_l1b_aapp
[WARNING: 2021-02-07 12:55:16 : satpy.scene] The following datasets were not created and may require resampling to be generated: DataID(name='night_fog'), DataID(name='ir108_3d'), DataID(name='cloudtop'), DataID(name='night_microphysics')
[WARNING: 2021-02-07 12:55:49 : satpy.readers.yaml_reader] No filenames found for reader: avhrr_l1b_aapp
[WARNING: 2021-02-07 12:55:55 : satpy.scene] The following datasets were not created and may require resampling to be generated: DataID(name='night_fog'), DataID(name='ir108_3d'), DataID(name='cloudtop'), DataID(name='night_microphysics')

To Reproduce

The plugin covers gets stuck due to https://github.com/pytroll/pyresample/issues/329 . This MCVE triggers the same, although that's pyresample's fault and not trollflow2's:

from datetime import datetime
from satpy.utils import debug_on; debug_on()
from trollflow2.plugins import get_scene_coverage
get_scene_coverage("Metop-C", datetime(2021, 2, 9, 14, 22, 3),
        datetime(2021, 2, 9, 16, 4, 3), "avhrr/3", "npgroenland2200m")

The area npgroenland2200m is defined as:

npgroenland2200m:
  description: npgroenland2200m
  projection:
    proj: stere
    lon_0: -30.0
    lat_0: 72.0
    lat_ts: 60.0
    ellps: WGS84
  shape:
    height: 1280
    width: 1024
  area_extent:
    lower_left_xy: [-1139231.5127295423, -1463534.2335492456]
    upper_right_xy: [1225459.1694661214, 1487863.414857883]

Due to https://github.com/pytroll/pyresample/issues/329 this never completes.

Expected behavior

I expect that if there is a bug triggered or caused by a plugin, such as this bug in pyresample, this should not cause trollflow2 to permanently stop processing. There should be an option for a timeout, such that the user can configure that if a plugin hasn't completed after (for example) 5 minutes, trollflow2 gives up on this job and moves on to processing the next message.

Actual results

In operational version — see above.

The MCVE has the following output, then nothing, just 100% CPU usage.

[DEBUG: 2021-02-10 11:03:57 : trollsched.satpass] kwargs: {'instrument': 'avhrr/3'}
[DEBUG: 2021-02-10 11:03:57 : trollsched.satpass] instrument: avhrr/3
[DEBUG: 2021-02-10 11:03:57 : pyorbital.tlefile] Fetch TLE from the internet.
[DEBUG: 2021-02-10 11:03:58 : trollsched.boundary] Instrument = avhrr/3
^CTraceback (most recent call last):
  File "../debug-trollflow2.py", line 4, in <module>
    get_scene_coverage("Metop-C", datetime(2021, 2, 9, 14, 22, 3),
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/trollflow2/plugins/__init__.py", line 415, in get_scene_coverage
    return 100 * overpass.area_coverage(area_def)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/trollsched/satpass.py", line 242, in area_coverage
    inter = self.boundary.contour_poly.intersection(area_boundary)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/pyresample/spherical.py", line 495, in intersection
    return self._bool_oper(other, -1)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/pyresample/spherical.py", line 476, in _bool_oper
    inter, edge2 = edge1.get_next_intersection(narcs2, inter)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/pyresample/spherical.py", line 305, in get_next_intersection
    inter = self.intersection(arc)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/pyresample/spherical.py", line 291, in intersection
    cd_ = c__.hdistance(d__)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/pyresample/spherical.py", line 86, in hdistance
    return 2 * np.arcsin((np.sin((point.lat - self.lat) / 2.0) ** 2.0 +
KeyboardInterrupt

In the operational version, I have also used pyrasite-shell to connect to the stuck instance. This reveals that trollflow2 is indeed stuck in the intersection calculation and that the ultimate cause is therefore https://github.com/pytroll/pyresample/issues/329:

>>> traceback.print_stack(sys._current_frames()[139653520578368])
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/bin/satpy_launcher.py", line 80, in <module>
    main()
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/bin/satpy_launcher.py", line 75, in main
    run(prod_list, topics=topics, test_message=test_message,
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/trollflow2/launcher.py", line 152, in run
    proc.start()
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/multiprocessing/popen_fork.py", line 75, in _launch
    code = process_obj._bootstrap(parent_sentinel=child_r)
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/trollflow2/launcher.py", line 268, in process
    cwrk.pop('fun')(job, **cwrk)
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/trollflow2/plugins/__init__.py", line 403, in covers
    cov = get_scene_coverage(platform_name, start_time, end_time,
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/trollflow2/plugins/__init__.py", line 425, in get_scene_coverage
    return 100 * overpass.area_coverage(area_def)
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/trollsched/satpass.py", line 242, in area_coverage
    inter = self.boundary.contour_poly.intersection(area_boundary)
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/pyresample/spherical.py", line 494, in intersection
    return self._bool_oper(other, -1)
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/pyresample/spherical.py", line 475, in _bool_oper
    inter, edge2 = edge1.get_next_intersection(narcs2, inter)
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/pyresample/spherical.py", line 305, in get_next_intersection
    inter = self.intersection(arc)
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/pyresample/spherical.py", line 284, in intersection
    for i in self.intersections(other_arc):
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/pyresample/spherical.py", line 257, in intersections
    eb_ = other_arc.start.cross2cart(other_arc.end).normalize()
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/pyresample/spherical.py", line 58, in cross2cart
    res = CCoordinate(np.array([-ad * c + be * f,
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/pyresample/spherical.py", line 112, in __init__
    self.cart = np.array(cart)

Environment Info:

Additional context

I wonder if there could be a general timeout configured in message/job processing here:

https://github.com/pytroll/trollflow2/blob/6f9815ad141e14dfaa8fb7b2a53ec83b97c86e6f/trollflow2/launcher.py#L265-L270

although of course the pyresample bug needs to be fixed at all, in theory any plugin could get stuck for any or no reason at all, and it would be nice if trollflow2 could recover from that.

gerritholl commented 3 years ago

Something peculiar happened just before trollflow2 gives up (which just now happened again). I set our timeout for EUMETSAT GDS data really high (2 hours), because sometimes we files that are already more than 90 minutes old when they arrive, and gatherer would conclude an immediate timeout, shipping out each granule on its own. With this long timeout, trollflow2 received this message covering 14:22 - 16:04. This message is coming from gatherer, and describes a collection of a file 14:22 - 14:25, 14:25 - 14:28, and 16:01 - 16:04, i.e. from the next orbit. Trollflow2 concludes that the regions nparktis1km is covered by 62% and npantarktis by 49%, then reads the TLE several more times (not sure why it does this but OK), and... then nothing more happens. It doesn't proceed to generate any products, it just stops. From then on, it reports on messages it receives, but nothing else anymore. Example from 2021-02-09:

[DEBUG: 2021-02-09 16:34:03 : Listener] New message received: pytroll://collection/metop/avhrr/eumetcast collection pytroll@oflks333.dwd.de 2021-02-09T16:34:03.001578 v1.01 application/json {"path": "", "start_time": "2021-02-09T14:22:03", "end_time": "2021-02-09T16:04:03", "processing_time": "2021-02-09T16:02:11", "sensor": "avhrr/3", "platform_name": "Metop-C", "collection_area_id": "npgroenland2200m", "collection": [{"start_time": "2021-02-09T14:22:03", "end_time": "2021-02-09T14:25:03", "uri": "/data/pytroll/IN/EMC_AVR/AVHR_xxx_1B_M03_20210209142203Z_20210209142503Z_N_O_20210209160211Z", "uid": "AVHR_xxx_1B_M03_20210209142203Z_20210209142503Z_N_O_20210209160211Z"}, {"start_time": "2021-02-09T14:25:03", "end_time": "2021-02-09T14:28:03", "uri": "/data/pytroll/IN/EMC_AVR/AVHR_xxx_1B_M03_20210209142503Z_20210209142803Z_N_O_20210209160325Z", "uid": "AVHR_xxx_1B_M03_20210209142503Z_20210209142803Z_N_O_20210209160325Z"}, {"start_time": "2021-02-09T16:01:03", "end_time": "2021-02-09T16:04:03", "uri": "/data/pytroll/IN/EMC_AVR/AVHR_xxx_1B_M03_20210209160103Z_20210209160403Z_N_O_20210209162744Z", "uid": "AVHR_xxx_1B_M03_20210209160103Z_20210209160403Z_N_O_20210209162744Z"}]}
[INFO: 2021-02-09 16:34:03 : trollflow2.launcher] Distributed processing not configured, using default scheduler
[INFO: 2021-02-09 16:34:03 : trollflow2.plugins] Generating scene
[DEBUG: 2021-02-09 16:34:03 : satpy.readers.yaml_reader] Reading ('/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/satpy/etc/readers/avhrr_l1b_aapp.yaml',)
[WARNING: 2021-02-09 16:34:03 : satpy.readers.yaml_reader] No filenames found for reader: avhrr_l1b_aapp
[DEBUG: 2021-02-09 16:34:03 : satpy.readers.yaml_reader] Reading ('/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/satpy/etc/readers/avhrr_l1b_eps.yaml',)
[DEBUG: 2021-02-09 16:34:03 : satpy.readers.yaml_reader] Assigning to avhrr_l1b_eps: ['/data/pytroll/IN/EMC_AVR/AVHR_xxx_1B_M03_20210209160103Z_20210209160403Z_N_O_20210209162744Z', '/data/pytroll/IN/EMC_AVR/AVHR_xxx_1B_M03_20210209142503Z_20210209142803Z_N_O_20210209160325Z', '/data/pytroll/IN/EMC_AVR/AVHR_xxx_1B_M03_20210209142203Z_20210209142503Z_N_O_20210209160211Z']
[DEBUG: 2021-02-09 16:34:03 : satpy.composites.config_loader] Looking for composites config file avhrr-3.yaml
[DEBUG: 2021-02-09 16:34:03 : satpy.composites.config_loader] Looking for composites config file visir.yaml
[DEBUG: 2021-02-09 16:34:03 : trollsched.satpass] kwargs: {'instrument': 'avhrr/3'}
[DEBUG: 2021-02-09 16:34:03 : trollsched.satpass] instrument: avhrr/3
[DEBUG: 2021-02-09 16:34:03 : pyorbital.tlefile] Reading TLE from /data/pytroll/IN/TLE/weather202102080510.tle
[DEBUG: 2021-02-09 16:34:03 : trollsched.boundary] Instrument = avhrr/3
[INFO: 2021-02-09 16:34:04 : trollflow2.plugins] Area coverage 10.05 % below threshold 15.00 %
[INFO: 2021-02-09 16:34:04 : trollflow2.plugins] Removing area eurol from the worklist
[DEBUG: 2021-02-09 16:34:04 : trollsched.satpass] kwargs: {'instrument': 'avhrr/3'}
[DEBUG: 2021-02-09 16:34:04 : trollsched.satpass] instrument: avhrr/3
[DEBUG: 2021-02-09 16:34:04 : pyorbital.tlefile] Reading TLE from /data/pytroll/IN/TLE/weather202102080510.tle
[DEBUG: 2021-02-09 16:34:04 : trollsched.boundary] Instrument = avhrr/3
[INFO: 2021-02-09 16:34:11 : trollflow2.plugins] Area coverage 0.00 % below threshold 15.00 %
[INFO: 2021-02-09 16:34:11 : trollflow2.plugins] Removing area nqceur1km from the worklist
[DEBUG: 2021-02-09 16:34:11 : trollsched.satpass] kwargs: {'instrument': 'avhrr/3'}
[DEBUG: 2021-02-09 16:34:11 : trollsched.satpass] instrument: avhrr/3
[DEBUG: 2021-02-09 16:34:11 : pyorbital.tlefile] Reading TLE from /data/pytroll/IN/TLE/weather202102080510.tle
[DEBUG: 2021-02-09 16:34:12 : trollsched.boundary] Instrument = avhrr/3
[DEBUG: 2021-02-09 16:34:19 : trollflow2.plugins] Area coverage 62.48% above threshold 15.00% - Carry on with nparktis1km
[DEBUG: 2021-02-09 16:34:19 : trollsched.satpass] kwargs: {'instrument': 'avhrr/3'}
[DEBUG: 2021-02-09 16:34:19 : trollsched.satpass] instrument: avhrr/3
[DEBUG: 2021-02-09 16:34:19 : pyorbital.tlefile] Reading TLE from /data/pytroll/IN/TLE/weather202102080510.tle
[DEBUG: 2021-02-09 16:34:19 : trollsched.boundary] Instrument = avhrr/3
[DEBUG: 2021-02-09 16:34:22 : trollflow2.plugins] Area coverage 48.66% above threshold 15.00% - Carry on with spantarktis3km
[DEBUG: 2021-02-09 16:34:22 : trollsched.satpass] kwargs: {'instrument': 'avhrr/3'}
[DEBUG: 2021-02-09 16:34:22 : trollsched.satpass] instrument: avhrr/3
[DEBUG: 2021-02-09 16:34:22 : pyorbital.tlefile] Reading TLE from /data/pytroll/IN/TLE/weather202102080510.tle
[DEBUG: 2021-02-09 16:34:22 : trollsched.boundary] Instrument = avhrr/3
[DEBUG: 2021-02-09 16:55:23 : Listener] New message received: pytroll://aapp/avhrr file pytroll@oflks333.dwd.de 2021-02-09T16:55:23.783803 v1.01 application/json {"path": "", "start_time": "2021-02-09T16:39:38", "platform_name": "NOAA-19", "uri": "file://localhost/data/pytroll/TMP/AAPP-processed/noaa19_20210209_1639_61862/hrpt_noaa19_20210209_1639_61862.l1b", "uid": "hrpt_noaa19_20210209_1639_61862.l1b", "sensor": "avhrr/3", "orig_platform_name": "NOAA_19", "filename": "hrpt_noaa19_20210209_1639_61862.l1b", "orbit_number": 61862, "format": "AAPP", "type": "Binary", "data_processing_level": "1B", "end_time": "2021-02-09T16:55:07.354000", "station": "unknown", "env": "test"}
[DEBUG: 2021-02-09 16:59:54 : Listener] New message received: pytroll://aapp/avhrr file pytroll@oflks333.dwd.de 2021-02-09T16:59:54.243837 v1.01 application/json {"path": "", "start_time": "2021-02-09T16:42:51", "end_time": "2021-02-09T16:51:22", "processing_time": "2021-02-09T16:42:51", "sensor": "avhrr/3", "platform_name": "Metop-B", "collection_area_id": "euro_ma", "filename": "hrpt_metop01_20210209_1642_43573.l1b", "uri": "file://localhost/data/pytroll/TMP/AAPP-processed/metop01_20210209_1642_43573/hrpt_metop01_20210209_1642_43573.l1b", "uid": "hrpt_metop01_20210209_1642_43573.l1b", "orbit_number": 43573, "format": "AAPP", "type": "Binary", "data_processing_level": "1B", "station": "unknown", "env": "test"}
[DEBUG: 2021-02-09 17:00:09 : Listener] New message received: pytroll://collection/metop/avhrr/eumetcast collection pytroll@oflks333.dwd.de 2021-02-09T17:00:09.597869 v1.01 application/json {"path": "", "start_time": "2021-02-09T15:52:03", "end_time": "2021-02-09T16:13:03", "processing_time": "2021-02-09T16:10:41", "sensor": "avhrr/3", "platform_name": "Metop-B", "collection_area_id": "spantarktis3km", "collection": [{"start_time": "2021-02-09T15:52:03", "end_time": "2021-02-09T15:55:03", "uri": "/data/pytroll/IN/EMC_AVR/AVHR_xxx_1B_M01_20210209155203Z_20210209155503Z_N_O_20210209161041Z", "uid": "AVHR_xxx_1B_M01_20210209155203Z_20210209155503Z_N_O_20210209161041Z"}, {"start_time": "2021-02-09T15:55:03", "end_time": "2021-02-09T15:58:03", "uri": "/data/pytroll/IN/EMC_AVR/AVHR_xxx_1B_M01_20210209155503Z_20210209155803Z_N_O_20210209161112Z", "uid": "AVHR_xxx_1B_M01_20210209155503Z_20210209155803Z_N_O_20210209161112Z"}, {"start_time": "2021-02-09T15:58:03", "end_time": "2021-02-09T16:01:03", "uri": "/data/pytroll/IN/EMC_AVR/AVHR_xxx_1B_M01_20210209155803Z_20210209160103Z_N_O_20210209165239Z", "uid": "AVHR_xxx_1B_M01_20210209155803Z_20210209160103Z_N_O_20210209165239Z"}, {"start_time": "2021-02-09T16:01:03", "end_time": "2021-02-09T16:04:03", "uri": "/data/pytroll/IN/EMC_AVR/AVHR_xxx_1B_M01_20210209160103Z_20210209160403Z_N_O_20210209165346Z", "uid": "AVHR_xxx_1B_M01_20210209160103Z_20210209160403Z_N_O_20210209165346Z"}, {"start_time": "2021-02-09T16:04:03", "end_time": "2021-02-09T16:07:03", "uri": "/data/pytroll/IN/EMC_AVR/AVHR_xxx_1B_M01_20210209160403Z_20210209160703Z_N_O_20210209165458Z", "uid": "AVHR_xxx_1B_M01_20210209160403Z_20210209160703Z_N_O_20210209165458Z"}, {"start_time": "2021-02-09T16:07:03", "end_time": "2021-02-09T16:10:03", "uri": "/data/pytroll/IN/EMC_AVR/AVHR_xxx_1B_M01_20210209160703Z_20210209161003Z_N_O_20210209165611Z", "uid": "AVHR_xxx_1B_M01_20210209160703Z_20210209161003Z_N_O_20210209165611Z"}, {"start_time": "2021-02-09T16:10:03", "end_time": "2021-02-09T16:13:03", "uri": "/data/pytroll/IN/EMC_AVR/AVHR_xxx_1B_M01_20210209161003Z_20210209161303Z_N_O_20210209165723Z", "uid": "AVHR_xxx_1B_M01_20210209161003Z_20210209161303Z_N_O_20210209165723Z"}]}
[DEBUG: 2021-02-09 17:05:53 : Listener] New message received: pytroll://aapp/avhrr file pytroll@oflks333.dwd.de 2021-02-09T17:05:53.865844 v1.01 application/json {"path": "", "start_time": "2021-02-09T16:42:51", "end_time": "2021-02-09T16:51:22", "processing_time": "2021-02-09T16:42:51", "sensor": "avhrr/3", "platform_name": "Metop-B", "collection_area_id": "nparktis1km", "filename": "hrpt_metop01_20210209_1642_43573.l1b", "uri": "file://localhost/data/pytroll/TMP/AAPP-processed/metop01_20210209_1642_43573/hrpt_metop01_20210209_1642_43573.l1b", "uid": "hrpt_metop01_20210209_1642_43573.l1b", "orbit_number": 43573, "format": "AAPP", "type": "Binary", "data_processing_level": "1B", "station": "unknown", "env": "test"}

This is actually different from 2021-02-07, when the last file processed before trollflow2 stopped did get produced, and I do not see a particularly long period consisting of granules from different orbits, so I'm not sure if there is a causal link here.

gerritholl commented 3 years ago

Meanwhile, satpy_launcher is still running:

$ pt 21069
satpy_launcher.(21069,pytroll)-+-satpy_launcher.(1071)
                               |-{satpy_launcher.}(21086)
                               |-{satpy_launcher.}(21087)
                               `-{satpy_launcher.}(21088)
gerritholl commented 3 years ago

satpy_launcher.(1071) is using 100% CPU.

gerritholl commented 3 years ago

I used the pyrasite-shell and found a traceback indicating its stuck in trollflow2.plugins.get_scene_coverage(), which calls trollsched.satpass.area_coverage, which calls pyresample, to ultimately get stuck in self.cart = np.array(cart), which uses 100% CPU and appears to never finish. Memory consumption is small, though. Using pyrasite-shell:

$ pyrasite-shell 1071
Pyrasite Shell 2.0
Connected to '/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/bin/python3.8 /opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/bin/satpy_launcher.py -n localhost /opt/pytroll/pytroll_inst/config/trollflow2.yaml'                                                                                               
Python 3.8.6 | packaged by conda-forge | (default, Dec 26 2020, 05:05:16)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
(DistantInteractiveConsole)

>>> import sys
s>>> sys._current_frames()
{139652793759488: <frame at 0x7f034b2c9040, file '<console>', line 1, code <module>>, 139653520578368: <frame at 0x7f034b232ac0, file '/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/pyresample/spherical.py', line 112, code __init__>}                                                
>>> traceback.print_stack(list(frms.values())[1])
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/bin/satpy_launcher.py", line 80, in <module>
    main()
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/bin/satpy_launcher.py", line 75, in main
    run(prod_list, topics=topics, test_message=test_message,
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/trollflow2/launcher.py", line 152, in run
    proc.start()
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/multiprocessing/popen_fork.py", line 75, in _launch
    code = process_obj._bootstrap(parent_sentinel=child_r)
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/trollflow2/launcher.py", line 268, in process
    cwrk.pop('fun')(job, **cwrk)
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/trollflow2/plugins/__init__.py", line 403, in covers
    cov = get_scene_coverage(platform_name, start_time, end_time,
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/trollflow2/plugins/__init__.py", line 425, in get_scene_coverage
    return 100 * overpass.area_coverage(area_def)
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/trollsched/satpass.py", line 242, in area_coverage
    inter = self.boundary.contour_poly.intersection(area_boundary)
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/pyresample/spherical.py", line 494, in intersection
    return self._bool_oper(other, -1)
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/pyresample/spherical.py", line 475, in _bool_oper
    inter, edge2 = edge1.get_next_intersection(narcs2, inter)
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/pyresample/spherical.py", line 326, in get_next_intersection
    return None, None
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/pyresample/spherical.py", line 298, in intersection
    return None
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/pyresample/spherical.py", line 264, in intersections
    return (SCoordinate(lon, lat),
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/pyresample/spherical.py", line 62, in cross2cart
    return res
  File "/opt/pytroll/pytroll_inst/miniconda3/envs/pytroll-py38/lib/python3.8/site-packages/pyresample/spherical.py", line 112, in __init__
    self.cart = np.array(cart)
gerritholl commented 3 years ago

Cause is in pyresample: https://github.com/pytroll/pyresample/issues/329 Although the bug is in pyresample, I wonder if trollflow2 should have a timeout somewhere so that it can "give up and get on" if something that should succeed in a fraction of a second is taking so long that "something must be wrong"?

adybbroe commented 3 years ago

Great detective work @gerritholl Thanks for this.