Open AlexanderdeMeij opened 2 years ago
Hi Alexander, It is difficult to tell out of this information. Do you know where this happens? During write or read? Which file? (you can start to look where the code stops in the log) I could try with your metdata, but I doubt that it is the problem. If the codes start to run normally, it means it is properly compiled. Still there might be a bug in some library (NetCDF /HDF5 is doing special things in a parallel environment). Do you have the possibility to try other versions of the compiler and/or netcdf library?
("Input/output error" is the error message from the NetCDF library and "STOP-ALL ERROR: Error in netcdf routine" is the error message from the emep model)
Hi, It happens randomly when the model reads in the WRF meteo file. One day it stops at day x and hour y. The next time I relaunch the model it stops somewhere else. First I created the WRF meteo myself, different resolutions and physics options. That didn't help. Then Massimo Vieno was so kind to provide me his WRF meteo, but unfortunately I've encountered the same problem. I attach the log file for your convenience. job2317_0_out.txt
Ah, this shows another error:
ERROR in WORDSPLIT : Problem at meteo_source:C:/eos/jeodpp/data/projects/IAM-SUL/meijaal/METEO_WRF_MV_2015/wrfout_d01_YYYY-MM-DD_00:00:00
Some compilers do not like the colon ":" in names. Could you try to simply give the meteo file a name without them?
But here the error happens right from the start. Didn't you say that it happens suddenly at random?
Edit: actually, the error is "too many words". That is something else. This error comes from the colons in the meteo name. It is a weakness in the code, that it fails for those names, because the sites/sondes use the colons as separator for some internal name handling. It should still work as intended though (I think!). So this is not what is causing the main error.
It happens randomly. The log file I sent is an example of one of the many tests I did; this one is based on the meteo files provided by Massimo. With my meteo files it stopped, for example on January 17th and the next time, after relaunch, somewhere in June, then another time in October. I will try anyway with changing the colon in the wrf out files.
Actually I do not think that the colons will make a difference, otherwise it would not work at all.
As I wrote on my email, I have seen this kind of behaviour when there is not enough disk quota for the full run outputs. Can you confirm that you have enough disk quota for the full run?
Other cause for this erratic behaviours is problems with the linking on the runtime environment. That is why I asked for the ldd
output
As suggested by you I've tried the command ldd name_executable. This is what I get:
linux-vdso.so.1 (0x00007ffd407ce000) libnetcdff.so.6 => not found libnetcdf.so.13 => not found libmpi_f90.so.1 => not found libmpi_f77.so.1 => not found libmpi.so.1 => not found libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f1c7d475000) libhwloc.so.5 => not found libgfortran.so.3 => not found libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f1c7d326000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f1c7d30b000) libquadmath.so.0 => /lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007f1c7d2c1000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f1c7d29c000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f1c7d0aa000) /lib64/ld-linux-x86-64.so.2 (0x00007f1c7d498000)
Some of the libraries are not found at runtime, which puzzles me.
I am running the model (rv4_34) on a docker system here at the JRC (Ispra), see issue #76
Did you built your docker image in two stages? If so, might have not copied all the runtime dependencies from the first stage image (build) to the second stage image (runtime). Can I have a look at your docker file?
Hi Alvaro, I post here the Dockerfile used.
It's very old and we have not updated till now because it was working well.
To compile the new exe for emep we did it into the container from this image and I thought that this way was safe because it would link the libs present in the image.
FROM debian:jessie
LABEL project="IAM-SUL" \
author="Dario Rodriguez" \
image_name="" \
version="1.0" \
released="2018-12-12" \
software_versions="OpenMPI 3.0 NCDF 4 Fortran 95" \
description="EMEP air quality model, for training and validation of the SHERPA simplified model"
ENV DEBIAN_FRONTEND=noninteractive
ENV TERM xterm
ENV DISPLAY :1.0
ENV LC_ALL C.UTF-8
RUN apt-get update && apt-get -yq install gcc gfortran g++\
build-essential \
tar \
bzip2 \
m4 \
zlib1g-dev \
libopenmpi-dev \
curl \
wget
RUN apt-get install -y apt-utils
RUN apt-get install -y libnetcdf-dev
COPY packages/hdf5-1.10.3.tar.bz2 hdf5-1.10.3.tar.bz2
COPY packages/netcdf-c-4.6.2.tar.gz netcdf-c-4.6.2.tar.gz
#COPY packages/netcdf-4.3.3.1.tar.gz netcdf-4.3.3.1.tar.gz
#COPY packages/netcdf-4.3.2.tar.gz netcdf-4.3.2.tar.gz
#COPY packages/netcdf-cxx4-4.2.1.tar.gz netcdf-cxx4-4.2.1.tar.gz
COPY packages/netcdf-fortran-4.4.4.tar.gz netcdf-fortran-4.4.4.tar.gz
#Build HDF5
RUN tar xjvf hdf5-1.10.3.tar.bz2 && \
cd hdf5-1.10.3 && \
CC=mpicc ./configure --enable-parallel --prefix=/usr/local && \
make -j4 && \
make install && \
cd .. && \
rm -rf /hdf5-1.10.3 /hdf5-1.10.3.tar.bz2
RUN apt-get install -y libcurl3 libcurl4-gnutls-dev
#Build netcdf
RUN tar xzvf netcdf-c-4.6.2.tar.gz && \
cd netcdf-c-4.6.2 && \
./configure --prefix=/usr \
CC=mpicc \
LDFLAGS=-L/usr/local/lib \
CFLAGS=-I/usr/local/include && \
make -j4 && \
make install && \
cd .. && \
rm -rf netcdf-c-4.6.2 netcdf-c-4.6.2.tar.gz
#RUN tar xzvf netcdf-cxx4-4.2.1.tar.gz && \
# cd netcdf-cxx4-4.2.1 && \
# ./configure --prefix=/usr/local \
# CC=mpicc \
# LDFLAGS=-L/usr/local/lib \
# CFLAGS=-I/usr/local/include && \
# make check && make -j4 && \
# make install && \
# cd .. && \
#rm -rf netcdf-cxx4-4.2.1 netcdf-cxx4-4.2.1.tar.gz
ENV LD_LIBRARY_PATH /usr/local/lib
RUN tar xzvf netcdf-fortran-4.4.4.tar.gz && \
cd netcdf-fortran-4.4.4 && \
./configure --prefix=/usr/local CC=/usr/bin/mpicc FC=/usr/bin/gfortran LDFLAGS=-L/usr/local/lib CFLAGS=-I/usr/local/include && \
make && make install && \
cd .. && \
rm -rf netcdf-fortran-4.4.4 netcdf-fortran-4.4.4.tar.gz
##install apt utils and sudo
RUN apt-get install -y sudo
RUN apt-get install -y openmpi-bin openmpi-common openssh-client openssh-server libopenmpi1.6 libopenmpi1.6-dbg
ENV MY_HOME=/home/iamsulproc
RUN export uid=35727 gid=41068 \
&& mkdir -p ${MY_HOME} \
&& echo "iamsulproc:x:${uid}:${gid}:iamsulproc,,:${MY_HOME}:/bin/bash" >> /etc/passwd \
&& echo "iamsulproc:x:${uid}:" >> /etc/group \
&& chown ${uid}:${gid} -R ${MY_HOME} \
&& echo "iamsulproc ALL=(ALL) NOPASSWD: ALL" > /etc/sudoers.d/iamsulproc \
&& chmod 0440 /etc/sudoers.d/iamsulproc
RUN apt-get install -y --no-install-recommends \
vim \
python-crypto \
python-dateutil \
python-dev \
python-lxml \
python-numpy \
python-openssl \
python-pip \
python-psycopg2 \
python-scipy \
python-urllib3 \
python-colorama \
python-distlib \
python-html5lib \
python-pkg-resources \
python-requests \
python-scipy \
python-setuptools \
python-six \
python-wheel \
python-pip-whl \
swig
RUN apt-get install -y unzip
USER iamsulproc
ENV HOME /home/iamsulproc
CMD /bin/bash
Which procedure do you suggest?
To compile the new exe for emep we did it into the container from this image and I thought that this way was safe because it would link the libs present in the image.
This is the base image for the emep model compilation and runtime. Or you're using docker only to compile the model?
We had compiled the model, last time and with this unstable situation, inside the container from this image and in a mounted external folder with the EMEP code.
So, the libs should have been taken from the image itself.
We had compiled the model, last time and with this unstable situation, inside the container from this image and in a mounted external folder with the EMEP code.
So, the libs should have been taken from the image itself.
If understand your answer correctly, you're compiling the model from within a Docker container and running the model outside the container.
The Makefile
included with the source code is set up for dynamic linking.
Therefore, the emepctm
binary only include references to the NetCDF and MPI libraries.
These references are resolved at compilation time (inside the container).
However, some libraries were not found at runtime (outside the container), as shown on your ldd
output
libnetcdff.so.6 => not found libnetcdf.so.13 => not found libmpi_f90.so.1 => not found libmpi_f77.so.1 => not found libhwloc.so.5 => not found libgfortran.so.3 => not found
Sorry, I explained me badly.
We compiled the model within a container and running itself there.
We just have physically the code out of it in a mounted folder. So, it's inside at compiling time.
But what you point out from the result of LDD is puzzling me. I don't understand this result indeed.
We compiled the model within a container and running itself there.
We just have physically the code out of it in a mounted folder. So, it's inside at compiling time.
OK, now is much clearer. Can I see the Dockerfile for the model compilation and runtime? Or more specifically, are you using a multi stage build/runtime image?
The attached file is an example multi stage build for the EMEP MSC-W model from 2 years ago. The image for the build stage has the compiler and all the packages needed to compile the model and required libraries. The runtime image only has the model executable and the library shared objects required to run the model.
Dear, I am contacting you about a problem I experience with the EMEP model using WRF meteorology. I previously contacted you via email, but I agree that's better to contact you via Github.
I am running the model (rv4_34) on a docker system here at the JRC (Ispra), see issue #76
With the IFS meteo, the EMEP model works fine, but when I used WRF meteo it gives me the following error: "Input/output error STOP-ALL ERROR: Error in netcdf routine".
This happens at any random day. This means that when I re-launch the model, it stops let say in February. When I re-launch again the model stops in June (and the problem in February doesn't pop up anymore), with the same error message. I've created new WRF meteo (different physics) and different resolution to reduce the file size. Currently I am trying to run the model with WRF meteo files of ~2.0GB.
I have no idea what thee problem is. I am also in contact with Massimo Vieno and he told me to add this line in NetCDF_mod.f90
Still I get the same error message.
As suggested by you I've tried the command ldd name_executable. This is what I get:
Also, this netCDF 4.6.2 has been built with the following features:
Do we miss something when we compile the EMEP code?
Best regards, Alexander de Meij