openego / eGon-data

GNU Affero General Public License v3.0
10 stars 4 forks source link

Coordination of server use #377

Open IlkaCu opened 3 years ago

IlkaCu commented 3 years ago

This issue is meant to coordinate the use of the egondata user/instance on our server in FL. We already agreed on starting a clean-run of the dev branch on every Friday. This will (most likely) make some debugging necessary on Mondays. To avoid conflicts while debugging, please comment in this issue before you start debugging and shortly note on which datasets/ parts of the workflow you will be working on.

ClaraBuettner commented 3 years ago

The run started on 6th of August is not finished yet. The task industry.temporal.insert-osm-ind-load is still running. Two tasks failed:

Both problems were caused by subst_ids which were in the mv_grid table but due to the new run of osmTGmod not part of the etrago buses. When I enforced a re-run of the mv-grid-dataset, the tasks finished successfully. The migration of osmTGmod to datasets solves this problem. Since this will be merged to dev soon, I will not look for another intermediate solution.

nesnoj commented 3 years ago

the new branch for the Fridays' run @gnn was talking about does not exist yet, right?

nesnoj commented 3 years ago

@nailend and me would like to have the branch features/#256-hh-load-area-profile-generator tested prior to merging to dev. @gnn could you please merge it into the Friday-branch before you start? Thx!

ClaraBuettner commented 3 years ago

the new branch for the Fridays' run @gnn was talking about does not exist yet, right?

I think he was talking about this branch: https://github.com/openego/eGon-data/tree/continuous-integration/run-everything-over-the-weekend

nesnoj commented 3 years ago

I think he was talking about this branch: https://github.com/openego/eGon-data/tree/continuous-integration/run-everything-over-the-weekend

Thank you, didn't copy the name during the webco and the docs have not been updated yet. I merged my branch into continuous-integration/run-everything-over-the-weekend Ready for takeoff!

nesnoj commented 3 years ago

Apparently, there has been no run on Friday?!

nesnoj commented 3 years ago

Apparently, there has been no run on Friday?!

May I start it today? @gnn

AmeliaNadal commented 3 years ago

I would find it great yes!

IlkaCu commented 3 years ago

gnn told me that he started a clean-run on Friday. But I didn't check the results yet.

nesnoj commented 3 years ago

gnn told me that he started a clean-run on Friday. But I didn't check the results yet.

Ah, I'm just seeing he didn't use the image we used before but created a new one. But I dunno which HTTP port it's listening on.. :( @gnn ?

nesnoj commented 3 years ago

Got it, it's port 9001 (do u know how u reconfigure the tunnel @AmeliaNadal ?).

Apparently, it crashed quite early at tasks osmtgmod.import-osm-data and electricity_demand.temporal.insert-cts-load :disappointed:.

It's very likely that the first one is caused by insufficient disk space as there're only 140G free (after cleaning up temp files) and that might not sufficient for the temp tables created by osmTGmod. So I propose to delete my old setup we used before and re-run the new one. Shall I do so? Any objections @IlkaCu @AmeliaNadal ?

AmeliaNadal commented 3 years ago

I could access the results (thanks for asking @nesnoj!) and my tasks haven't run. So I have no objection that you re-run the workflow ;)

nesnoj commented 3 years ago

Done.

Update: osmtgmod.import-osm-data has been run successfully :D

nesnoj commented 3 years ago

I'm done on the server and happy, go ahead @IlkaCu

nesnoj commented 3 years ago

@IlkaCu and I decided to restart the weekend run tonight. I merged dev into continuous-integration/run-everything-over-the-weekend and I'm now done with all my stuff ... please go ahead @IlkaCu

IlkaCu commented 3 years ago

I merged one bug fix into continuous-integration/run-everything-over-the-weekend

IlkaCu commented 3 years ago

I merged another bug fix: ee038e40d7a0ce7b7b11cb50b51d8757f65337d9 @nesnoj: I hope this works now.

nesnoj commented 3 years ago

I merged another bug fix: ee038e4 @nesnoj: I hope this works now.

Yepp, looks good :+1: Run started :runner:

IlkaCu commented 3 years ago

Great, thank you.

IlkaCu commented 3 years ago

If I see it right, the server run in normal mode has been successful. :partying_face: Which means we are now able to merge the different features and bug fixes into dev via PR. Or could it be an option to merge the whole continuous-integration-Branch into dev (I guess gnn would like this option)?

nesnoj commented 3 years ago

If I see it right, the server run in normal mode has been successful. :partying_face:

Awesome!

Which means we are now able to merge the different features and bug fixes into dev via PR. Or could it be an option to merge the whole continuous-integration-Branch into dev (I guess gnn would like this option)?

Generally I'm fine with both options, but I guess that there might be some additional checks necessary (at least in #260) before it can get merged to dev. I reckon there will be some more commits in the branches so separate merging via PRs seems more clean to me.

nesnoj commented 3 years ago

A task of mine failed due to some column name adjustments in 5b7d9f2dc22f989292de7e979c421b2fae3914a3. I had to clear some stuff, They're re-running now..

gnn commented 3 years ago

I see that I missed an open question last week. Sorry for that.

Which means we are now able to merge the different features and bug fixes into dev via PR. Or could it be an option to merge the whole continuous-integration-Branch into dev (I guess gnn would like this option)?

Since the CR branch might contain changes which are working but not yet meant to be merged into dev, I'm in favour of merging tested feature branches into dev individually. This also makes it easier to figure out where a change came from, which is important when trying to fix bugs which are discovered later on. Hence my :+1: to @nesnoj's comment. :) For anybody running into the issue of having to resolve the same conflicts multiple times because of this, have a look at git's rerere.enabled option, which makes git automatically reuse known conflict resolutions. You can switch on that option via git config --global rerere.enabled true for all your repositories or via git config --local rerere.enabled true inside a repository if you only want to switch it on for that particular repository.

nesnoj commented 3 years ago

For anybody running into the issue of having to resolve the same conflicts multiple times because of this, have a look at git's rerere.enabled option, which makes git automatically reuse known conflict resolutions. You can switch on that option via git config --global rerere.enabled true for all your repositories or via git config --local rerere.enabled true inside a repository if you only want to switch it on for that particular repository.

That's exactly what has been annoying most when keeping track of 2 branches. Thx for the hint! :pray:

BTW @IlkaCu : Some of "your" tasks failed in the current run. Also, we get a No space left on device in task power_plants.pv_rooftop.pv-rooftop-per-mv-grid for some reason, bu there're 300 GB free :monocle_face:

ClaraBuettner commented 3 years ago

Also, we get a No space left on device in task power_plants.pv_rooftop.pv-rooftop-per-mv-grid for some reason, bu there're 300 GB free monocle_face

This seems to be caused by parallel running tasks, electricity_demand.temporal.insert-cts-load was running in parallel and might have caused this problem. I cleared the task power_plants.pv_rooftop.pv-rooftop-per-mv-grid, and it was successful when no other task was running. I don't really know how to solve this properly. Maybe #269 could help because each task currently uses 8 treads.

nesnoj commented 3 years ago

I'd like to test another feature on the server, is that ok with u? (please comment via :+1: or :-1:)

nesnoj commented 3 years ago

Done, restarted. @AmeliaNadal I forgot the name of the task u were talking about this morning (the one not executed). Is it industrial_gas_demand.insert-industrial-gas-demand? If so, it's now back in the pipeline as I merged another PR which included the current dev...

AmeliaNadal commented 3 years ago

Yes, it was this one! Nice!

nesnoj commented 3 years ago

Hmm, the task import-zensus-misc has been running for 46h now, that's odd... Yesterday a started a DE run on a fresh new RLI server (yay!) on the same branch (weekend) and it took 3h to complete.

I'm curious if we encounter the same problem with the new run tonight..

ClaraBuettner commented 3 years ago

Two tasks failed in the latest run over the weekend:

industrial_sites.download-import-industrial-sites FileNotFoundError: [Errno 2] No such file or directory: 'data_bundle_egon_data/industrial_sites/MA_Schmidt_Industriestandorte_georef.csv' @IlkaCu: Can you help?

power_plants.pv_rooftop.pv-rooftop-per-mv-grid sqlalchemy.exc.OperationalError: (psycopg2.errors.DiskFull) could not resize shared memory segment "/PostgreSQL.241700255" to 8388608 bytes: No space left on device This error already occurred in the run before. I was hoping that reducing the number of cores per task solves it, but it did not.

osm_buildings_streets.extract-buildingsis still running (started 2021-09-10T22:37:37+00:00).

nesnoj commented 3 years ago

osm_buildings_streets.extract-buildingsis still running (started 2021-09-10T22:37:37+00:00).

Yes, I'll take care..

IlkaCu commented 3 years ago

industrial_sites.download-import-industrial-sites FileNotFoundError: [Errno 2] No such file or directory: 'data_bundle_egon_data/industrial_sites/MA_Schmidt_Industriestandorte_georef.csv' @IlkaCu: Can you help?

I'll take care of this.

IlkaCu commented 3 years ago

industrial_sites.download-import-industrial-sites FileNotFoundError: [Errno 2] No such file or directory: 'data_bundle_egon_data/industrial_sites/MA_Schmidt_Industriestandorte_georef.csv' @IlkaCu: Can you help?

I'll take care of this.

This error was caused by a missing dependency. The tasks and the ones depending on this are now running on the server.

CarlosEpia commented 3 years ago

I already fixed a bug in the script: renewable_feedin, and will try to clear that task right now.

ClaraBuettner commented 3 years ago

Two tasks (power_plants and chp_plants) failed in the latest run because of the renaming fromsubst_id to bus_id. I already fixed this and cleared the tasks. But power_plants.allocate-conventional-non-chp-power-plants failed again. It looks like there is an empty geodataframe:

[2021-10-04 08:52:00,854] {logging_mixin.py:120} INFO - Running <TaskInstance: egon-data-processing-pipeline.power_plants.allocate-conventional-non-chp-power-plants 2021-10-01T21:43:54+00:00 [running]> on host at32
[2021-10-04 08:52:01,475] {taskinstance.py:1150} ERROR - Cannot transform naive geometries.  Please set a crs on the object first.
gnn commented 3 years ago

I just realized that I forgot to start the weekend run on Friday. If everybody is OK with that, I'll start it on Monday morning. Just saying nothing is OK. If I don't get any vetoes until 11:00, I'll give it a go.

nailend commented 3 years ago

Did you start the run yet? I just merged #430 into continuous-integration/run-everything-over-the-weekend. Would be nice if you could run it over the weekend. @gnn

gnn commented 3 years ago

Just a (late) heads up: the code on the CI branch was buggy and didn't want to start. I only managed to fix it on Monday evening and started the run afterwards. Hope that doesn't collide with any other server activity.

IlkaCu commented 3 years ago

It seems that the code on the CI branch is buggy again and the dag wasn't executed during the weekend. The last commit on the CI branch was 8a96d0bc0641212d8df840de2aa3c53e6c3bb540 by @nesnoj @nesnoj, @gnn: Any ideas how to solve it? Who takes care of it?

fwitte commented 3 years ago

It might be the case, that there is no module substation in the egon.data.processing module, but it is imported in the pipeline:

https://github.com/openego/eGon-data/blob/8a96d0bc0641212d8df840de2aa3c53e6c3bb540/src/egon/data/airflow/dags/pipeline.py#L71

Searching it within the filestructure, substation is only found in the gas_areas module of egon.data.processing, but neither as a function nor as a class. Only within variable names.

fwitte commented 3 years ago

Same issue with this import:

https://github.com/openego/eGon-data/blob/8a96d0bc0641212d8df840de2aa3c53e6c3bb540/src/egon/data/airflow/dags/pipeline.py#L61

IlkaCu commented 3 years ago

Same issue with this import:

https://github.com/openego/eGon-data/blob/8a96d0bc0641212d8df840de2aa3c53e6c3bb540/src/egon/data/airflow/dags/pipeline.py#L61

Thanks for finding that - this is my fault. I will fix it. Edit: I added two commits d5e0fb7754fa7f489519de60629ff21a7acdf5d2 and f028e94cc7b9229f7c7225e915382f06757af8d8

nesnoj commented 3 years ago

So did u restart the DAG?

IlkaCu commented 3 years ago

So did u restart the DAG?

Yes, a couple of minutes ago.

IlkaCu commented 3 years ago

We should maybe think of a new start date for the weekend-run. What do you think about Fridays at noon? We would then have the change to fix bugs at Fridays afternoon instead of Monday or during the weekend.

gnn commented 3 years ago

Wasn't noon, but I started the run at ~14:30 today. But there weren't any errors so nothing to fix. I also got basic checks working again, so after you pushed something onto the CI branch you can have a look under the GitHub Actions tab and check whether the checks pass or fail with the state you pushed. Might take some time though. You can also check that locally (with the CI branch checked out) by running tox -e py38-nocov-linux or tox -e py37-nocov-linux depending on whether Python 3.7 or 3.8 is installed on your system. After the weekend, i.e. once I know that the checks don't interfere with anything, they'll go live on "dev" after which you can monitor the status of these checks on each individual PR and no PRs with failing checks will be allowed to be merged.

ClaraBuettner commented 3 years ago

Three tasks failed in the latest run: re_potential_areas.download-datasets:

...
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 502: Bad Gateway

I have never seen this problem before, after clearing the task it was successful. So probably it was a temporary problem with the connection.

chp.extension-HH:

...
  File "/home/egondata/git-repos/friday-evening-weekend-run/environment/lib/python3.8/site-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL:  sorry, too many clients already

I think this is not related to the code. @gnn Do you have an idea how to solver or prevent this?

heat_demand_timeseries.HTS.demand-profile-generator:

...
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_this_station[
[2021-11-28 06:26:27,032] {local_task_job.py:102} INFO - Task exited with return code Negsignal.SIGKILL

Looks like the was not enough memory, but the server has about 300GB RAM so I'm not sure if this could have happened. I know that this tasks needs a lot of RAM but it was running before. The electricity household profiles were created in parallel, maybe both task needed much memory and caused this problem. I will try to clear the task again.

nesnoj commented 3 years ago

The electricity household profiles were created in parallel, maybe both task needed much memory and caused this problem.

Our HH peak load task should use about 30..40 GB of RAM.

gnn commented 3 years ago

Three tasks failed in the latest run: re_potential_areas.download-datasets:

[..]

chp.extension-HH:

...
  File "/home/egondata/git-repos/friday-evening-weekend-run/environment/lib/python3.8/site-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL:  sorry, too many clients already

I think this is not related to the code. @gnn Do you have an idea how to solver or prevent this?

I've seen this with @AmeliaNadal's non-dockered database previously. The problem occurs when a task tries to connect to the database and the database already has too many open connections. (I also can't exclude the possibility of SQLAlchemy's connection pool being fully used, but I consider that option less likely.) It's hard to solve this, once the problem occurs. In order to prevent it from happening, one should make sure that all SQLAlchemy Sessions are properly closed after they are used. Raw connections should seldom be used and should obviously also be immediately closed after use.

IlkaCu commented 2 years ago

Unfortunately up to now three tasks in the current CI branch run failed: