Open IlkaCu opened 3 years ago
The run started on 6th of August is not finished yet. The task industry.temporal.insert-osm-ind-load
is still running.
Two tasks failed:
heat_etrago.supply
: This fails because some subst_id's of the mv grids are not in the etrago-bus table. I assume this is happens because the MV grids are already versioned and were skipped, but osmTGmod was running again. So even if some id changed in osmTGmod, the subst_id of the mv grids are not updated. I will check this but will wait until industry.temporal.insert-osm-ind-load
is finished because it depends of the mv grids. power_plants.wind_farms.insert
: This is the same problem described in #354 . Since I can not reproduce this issue in other instances, I will try to debug this in the clean-run instance. Both problems were caused by subst_ids which were in the mv_grid table but due to the new run of osmTGmod not part of the etrago buses. When I enforced a re-run of the mv-grid-dataset, the tasks finished successfully. The migration of osmTGmod to datasets solves this problem. Since this will be merged to dev soon, I will not look for another intermediate solution.
the new branch for the Fridays' run @gnn was talking about does not exist yet, right?
@nailend and me would like to have the branch features/#256-hh-load-area-profile-generator
tested prior to merging to dev.
@gnn could you please merge it into the Friday-branch before you start? Thx!
the new branch for the Fridays' run @gnn was talking about does not exist yet, right?
I think he was talking about this branch: https://github.com/openego/eGon-data/tree/continuous-integration/run-everything-over-the-weekend
I think he was talking about this branch: https://github.com/openego/eGon-data/tree/continuous-integration/run-everything-over-the-weekend
Thank you, didn't copy the name during the webco and the docs have not been updated yet.
I merged my branch into continuous-integration/run-everything-over-the-weekend
Ready for takeoff!
Apparently, there has been no run on Friday?!
Apparently, there has been no run on Friday?!
May I start it today? @gnn
I would find it great yes!
gnn told me that he started a clean-run on Friday. But I didn't check the results yet.
gnn told me that he started a clean-run on Friday. But I didn't check the results yet.
Ah, I'm just seeing he didn't use the image we used before but created a new one. But I dunno which HTTP port it's listening on.. :( @gnn ?
Got it, it's port 9001 (do u know how u reconfigure the tunnel @AmeliaNadal ?).
Apparently, it crashed quite early at tasks
osmtgmod.import-osm-data
and
electricity_demand.temporal.insert-cts-load
:disappointed:.
It's very likely that the first one is caused by insufficient disk space as there're only 140G free (after cleaning up temp files) and that might not sufficient for the temp tables created by osmTGmod. So I propose to delete my old setup we used before and re-run the new one. Shall I do so? Any objections @IlkaCu @AmeliaNadal ?
I could access the results (thanks for asking @nesnoj!) and my tasks haven't run. So I have no objection that you re-run the workflow ;)
Done.
Update: osmtgmod.import-osm-data
has been run successfully :D
I'm done on the server and happy, go ahead @IlkaCu
@IlkaCu and I decided to restart the weekend run tonight. I merged dev
into continuous-integration/run-everything-over-the-weekend
and I'm now done with all my stuff ... please go ahead @IlkaCu
I merged one bug fix into continuous-integration/run-everything-over-the-weekend
I merged another bug fix: ee038e40d7a0ce7b7b11cb50b51d8757f65337d9 @nesnoj: I hope this works now.
I merged another bug fix: ee038e4 @nesnoj: I hope this works now.
Yepp, looks good :+1: Run started :runner:
Great, thank you.
If I see it right, the server run in normal mode has been successful. :partying_face:
Which means we are now able to merge the different features and bug fixes into dev via PR. Or could it be an option to merge the whole continuous-integration
-Branch into dev (I guess gnn would like this option)?
If I see it right, the server run in normal mode has been successful. :partying_face:
Awesome!
Which means we are now able to merge the different features and bug fixes into dev via PR. Or could it be an option to merge the whole
continuous-integration
-Branch into dev (I guess gnn would like this option)?
Generally I'm fine with both options, but I guess that there might be some additional checks necessary (at least in #260) before it can get merged to dev. I reckon there will be some more commits in the branches so separate merging via PRs seems more clean to me.
A task of mine failed due to some column name adjustments in 5b7d9f2dc22f989292de7e979c421b2fae3914a3. I had to clear some stuff, They're re-running now..
I see that I missed an open question last week. Sorry for that.
Which means we are now able to merge the different features and bug fixes into dev via PR. Or could it be an option to merge the whole
continuous-integration
-Branch into dev (I guess gnn would like this option)?
Since the CR branch might contain changes which are working but not yet meant to be merged into dev, I'm in favour of merging tested feature branches into dev individually. This also makes it easier to figure out where a change came from, which is important when trying to fix bugs which are discovered later on. Hence my :+1: to @nesnoj's comment. :)
For anybody running into the issue of having to resolve the same conflicts multiple times because of this, have a look at git
's rerere.enabled
option, which makes git
automatically reuse known conflict resolutions. You can switch on that option via
git config --global rerere.enabled true
for all your repositories or via git config --local rerere.enabled true
inside a repository if you only want to switch it on for that particular repository.
For anybody running into the issue of having to resolve the same conflicts multiple times because of this, have a look at
git
'srerere.enabled
option, which makesgit
automatically reuse known conflict resolutions. You can switch on that option viagit config --global rerere.enabled true
for all your repositories or viagit config --local rerere.enabled true
inside a repository if you only want to switch it on for that particular repository.
That's exactly what has been annoying most when keeping track of 2 branches. Thx for the hint! :pray:
BTW @IlkaCu : Some of "your" tasks failed in the current run. Also, we get a No space left on device
in task power_plants.pv_rooftop.pv-rooftop-per-mv-grid
for some reason, bu there're 300 GB free :monocle_face:
Also, we get a
No space left on device
in taskpower_plants.pv_rooftop.pv-rooftop-per-mv-grid
for some reason, bu there're 300 GB free monocle_face
This seems to be caused by parallel running tasks, electricity_demand.temporal.insert-cts-load was running in parallel and might have caused this problem. I cleared the task power_plants.pv_rooftop.pv-rooftop-per-mv-grid
, and it was successful when no other task was running.
I don't really know how to solve this properly. Maybe #269 could help because each task currently uses 8 treads.
I'd like to test another feature on the server, is that ok with u? (please comment via :+1: or :-1:)
Done, restarted.
@AmeliaNadal I forgot the name of the task u were talking about this morning (the one not executed). Is it industrial_gas_demand.insert-industrial-gas-demand
? If so, it's now back in the pipeline as I merged another PR which included the current dev
...
Yes, it was this one! Nice!
Hmm, the task import-zensus-misc
has been running for 46h now, that's odd...
Yesterday a started a DE run on a fresh new RLI server (yay!) on the same branch (weekend) and it took 3h to complete.
I'm curious if we encounter the same problem with the new run tonight..
Two tasks failed in the latest run over the weekend:
industrial_sites.download-import-industrial-sites
FileNotFoundError: [Errno 2] No such file or directory: 'data_bundle_egon_data/industrial_sites/MA_Schmidt_Industriestandorte_georef.csv'
@IlkaCu: Can you help?
power_plants.pv_rooftop.pv-rooftop-per-mv-grid
sqlalchemy.exc.OperationalError: (psycopg2.errors.DiskFull) could not resize shared memory segment "/PostgreSQL.241700255" to 8388608 bytes: No space left on device
This error already occurred in the run before. I was hoping that reducing the number of cores per task solves it, but it did not.
osm_buildings_streets.extract-buildings
is still running (started 2021-09-10T22:37:37+00:00).
osm_buildings_streets.extract-buildings
is still running (started 2021-09-10T22:37:37+00:00).
Yes, I'll take care..
industrial_sites.download-import-industrial-sites
FileNotFoundError: [Errno 2] No such file or directory: 'data_bundle_egon_data/industrial_sites/MA_Schmidt_Industriestandorte_georef.csv' @IlkaCu: Can you help?
I'll take care of this.
industrial_sites.download-import-industrial-sites
FileNotFoundError: [Errno 2] No such file or directory: 'data_bundle_egon_data/industrial_sites/MA_Schmidt_Industriestandorte_georef.csv' @IlkaCu: Can you help?I'll take care of this.
This error was caused by a missing dependency. The tasks and the ones depending on this are now running on the server.
I already fixed a bug in the script: renewable_feedin, and will try to clear that task right now.
Two tasks (power_plants
and chp_plants
) failed in the latest run because of the renaming fromsubst_id
to bus_id
. I already fixed this and cleared the tasks.
But power_plants.allocate-conventional-non-chp-power-plants
failed again. It looks like there is an empty geodataframe:
[2021-10-04 08:52:00,854] {logging_mixin.py:120} INFO - Running <TaskInstance: egon-data-processing-pipeline.power_plants.allocate-conventional-non-chp-power-plants 2021-10-01T21:43:54+00:00 [running]> on host at32
[2021-10-04 08:52:01,475] {taskinstance.py:1150} ERROR - Cannot transform naive geometries. Please set a crs on the object first.
I just realized that I forgot to start the weekend run on Friday. If everybody is OK with that, I'll start it on Monday morning. Just saying nothing is OK. If I don't get any vetoes until 11:00, I'll give it a go.
Did you start the run yet? I just merged #430 into continuous-integration/run-everything-over-the-weekend
. Would be nice if you could run it over the weekend. @gnn
Just a (late) heads up: the code on the CI branch was buggy and didn't want to start. I only managed to fix it on Monday evening and started the run afterwards. Hope that doesn't collide with any other server activity.
It seems that the code on the CI branch is buggy again and the dag wasn't executed during the weekend. The last commit on the CI branch was 8a96d0bc0641212d8df840de2aa3c53e6c3bb540 by @nesnoj @nesnoj, @gnn: Any ideas how to solve it? Who takes care of it?
It might be the case, that there is no module substation
in the egon.data.processing
module, but it is imported in the pipeline:
Searching it within the filestructure, substation is only found in the gas_areas
module of egon.data.processing
, but neither as a function nor as a class. Only within variable names.
Same issue with this import:
Thanks for finding that - this is my fault. I will fix it. Edit: I added two commits d5e0fb7754fa7f489519de60629ff21a7acdf5d2 and f028e94cc7b9229f7c7225e915382f06757af8d8
So did u restart the DAG?
So did u restart the DAG?
Yes, a couple of minutes ago.
We should maybe think of a new start date for the weekend-run. What do you think about Fridays at noon? We would then have the change to fix bugs at Fridays afternoon instead of Monday or during the weekend.
Wasn't noon, but I started the run at ~14:30 today. But there weren't any errors so nothing to fix. I also got basic checks working again, so after you pushed something onto the CI branch you can have a look under the GitHub Actions tab and check whether the checks pass or fail with the state you pushed. Might take some time though. You can also check that locally (with the CI branch checked out) by running tox -e py38-nocov-linux
or tox -e py37-nocov-linux
depending on whether Python 3.7 or 3.8 is installed on your system. After the weekend, i.e. once I know that the checks don't interfere with anything, they'll go live on "dev" after which you can monitor the status of these checks on each individual PR and no PRs with failing checks will be allowed to be merged.
Three tasks failed in the latest run:
re_potential_areas.download-datasets
:
...
File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 502: Bad Gateway
I have never seen this problem before, after clearing the task it was successful. So probably it was a temporary problem with the connection.
chp.extension-HH
:
...
File "/home/egondata/git-repos/friday-evening-weekend-run/environment/lib/python3.8/site-packages/psycopg2/__init__.py", line 122, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL: sorry, too many clients already
I think this is not related to the code. @gnn Do you have an idea how to solver or prevent this?
heat_demand_timeseries.HTS.demand-profile-generator
:
...
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
selected_this_station[
[2021-11-28 06:26:27,032] {local_task_job.py:102} INFO - Task exited with return code Negsignal.SIGKILL
Looks like the was not enough memory, but the server has about 300GB RAM so I'm not sure if this could have happened. I know that this tasks needs a lot of RAM but it was running before. The electricity household profiles were created in parallel, maybe both task needed much memory and caused this problem. I will try to clear the task again.
The electricity household profiles were created in parallel, maybe both task needed much memory and caused this problem.
Our HH peak load task should use about 30..40 GB of RAM.
Three tasks failed in the latest run:
re_potential_areas.download-datasets
:
[..]
chp.extension-HH
:... File "/home/egondata/git-repos/friday-evening-weekend-run/environment/lib/python3.8/site-packages/psycopg2/__init__.py", line 122, in connect conn = _connect(dsn, connection_factory=connection_factory, **kwasync) sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL: sorry, too many clients already
I think this is not related to the code. @gnn Do you have an idea how to solver or prevent this?
I've seen this with @AmeliaNadal's non-dockered database previously. The problem occurs when a task tries to connect to the database and the database already has too many open connections. (I also can't exclude the possibility of SQLAlchemy's connection pool being fully used, but I consider that option less likely.) It's hard to solve this, once the problem occurs. In order to prevent it from happening, one should make sure that all SQLAlchemy Session
s are properly closed after they are used. Raw connections should seldom be used and should obviously also be immediately closed after use.
Unfortunately up to now three tasks in the current CI branch run failed:
This issue is meant to coordinate the use of the
egondata
user/instance on our server in FL. We already agreed on starting a clean-run of the dev branch on every Friday. This will (most likely) make some debugging necessary on Mondays. To avoid conflicts while debugging, please comment in this issue before you start debugging and shortly note on which datasets/ parts of the workflow you will be working on.