metno / emep-ctm

Open Source EMEP/MSC-W model
GNU General Public License v3.0
27 stars 18 forks source link

STOP-ALL ERROR: Error in netcdf routine #108

Open AlexanderdeMeij opened 2 years ago

AlexanderdeMeij commented 2 years ago

Dear, I am contacting you about a problem I experience with the EMEP model using WRF meteorology. I previously contacted you via email, but I agree that's better to contact you via Github.

I am running the model (rv4_34) on a docker system here at the JRC (Ispra), see issue #76

With the IFS meteo, the EMEP model works fine, but when I used WRF meteo it gives me the following error: "Input/output error STOP-ALL ERROR: Error in netcdf routine".

This happens at any random day. This means that when I re-launch the model, it stops let say in February. When I re-launch again the model stops in June (and the problem in February doesn't pop up anymore), with the same error message. I've created new WRF meteo (different physics) and different resolution to reduce the file size. Currently I am trying to run the model with WRF meteo files of ~2.0GB.

I have no idea what thee problem is. I am also in contact with Massimo Vieno and he told me to add this line in NetCDF_mod.f90

!send all data to me=0
     !outCDFtag=outCDFtag+1 
     ! Alex:
     outCDFtag=mod(outCDFtag+1,100000) !!! MV 04/21 Peter fix fo MPI_SEND tag error
     ! Alex

Still I get the same error message.

As suggested by you I've tried the command ldd name_executable. This is what I get:

        linux-vdso.so.1 (0x00007ffd407ce000)
    libnetcdff.so.6 => not found
    libnetcdf.so.13 => not found
    libmpi_f90.so.1 => not found
    libmpi_f77.so.1 => not found
    libmpi.so.1 => not found
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f1c7d475000)
    libhwloc.so.5 => not found
    libgfortran.so.3 => not found
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f1c7d326000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f1c7d30b000)
    libquadmath.so.0 => /lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007f1c7d2c1000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f1c7d29c000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f1c7d0aa000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f1c7d498000)

Also, this netCDF 4.6.2 has been built with the following features:

  --cc            -> mpicc
  --cflags        -> -I/usr/include 
  --libs          -> -L/usr/lib -lnetcdf

  --has-c++       -> no
  --cxx           -> 

  --has-c++4      -> no
  --cxx4          -> 

  --has-fortran   -> yes
  --fc            -> /usr/bin/gfortran
  --fflags        -> -I/usr/local/include
  --flibs         -> -L/usr/local/lib -lnetcdff -L/usr/local/lib -lnetcdf -lnetcdf
  --has-f90       -> no
  --has-f03       -> yes

  --has-dap       -> yes
  --has-dap2      -> yes
  --has-dap4      -> yes
  --has-nc2       -> yes
  --has-nc4       -> yes
  --has-hdf5      -> yes
  --has-hdf4      -> no
  --has-logging   -> no
  --has-pnetcdf   -> no
  --has-szlib     -> no
  --has-cdf5      -> yes
  --has-parallel4 -> yes
  --has-parallel  -> yes

  --prefix        -> /usr
  --includedir    -> /usr/include
  --libdir        -> /usr/lib
  --version       -> netCDF 4.6.2

Do we miss something when we compile the EMEP code?

Best regards, Alexander de Meij

gitpeterwind commented 2 years ago

Hi Alexander, It is difficult to tell out of this information. Do you know where this happens? During write or read? Which file? (you can start to look where the code stops in the log) I could try with your metdata, but I doubt that it is the problem. If the codes start to run normally, it means it is properly compiled. Still there might be a bug in some library (NetCDF /HDF5 is doing special things in a parallel environment). Do you have the possibility to try other versions of the compiler and/or netcdf library?

gitpeterwind commented 2 years ago

("Input/output error" is the error message from the NetCDF library and "STOP-ALL ERROR: Error in netcdf routine" is the error message from the emep model)

AlexanderdeMeij commented 2 years ago

Hi, It happens randomly when the model reads in the WRF meteo file. One day it stops at day x and hour y. The next time I relaunch the model it stops somewhere else. First I created the WRF meteo myself, different resolutions and physics options. That didn't help. Then Massimo Vieno was so kind to provide me his WRF meteo, but unfortunately I've encountered the same problem. I attach the log file for your convenience. job2317_0_out.txt

gitpeterwind commented 2 years ago

Ah, this shows another error: ERROR in WORDSPLIT : Problem at meteo_source:C:/eos/jeodpp/data/projects/IAM-SUL/meijaal/METEO_WRF_MV_2015/wrfout_d01_YYYY-MM-DD_00:00:00 Some compilers do not like the colon ":" in names. Could you try to simply give the meteo file a name without them? But here the error happens right from the start. Didn't you say that it happens suddenly at random?

Edit: actually, the error is "too many words". That is something else. This error comes from the colons in the meteo name. It is a weakness in the code, that it fails for those names, because the sites/sondes use the colons as separator for some internal name handling. It should still work as intended though (I think!). So this is not what is causing the main error.

AlexanderdeMeij commented 2 years ago

It happens randomly. The log file I sent is an example of one of the many tests I did; this one is based on the meteo files provided by Massimo. With my meteo files it stopped, for example on January 17th and the next time, after relaunch, somewhere in June, then another time in October. I will try anyway with changing the colon in the wrf out files.

gitpeterwind commented 2 years ago

Actually I do not think that the colons will make a difference, otherwise it would not work at all.

avaldebe commented 2 years ago

As I wrote on my email, I have seen this kind of behaviour when there is not enough disk quota for the full run outputs. Can you confirm that you have enough disk quota for the full run?

Other cause for this erratic behaviours is problems with the linking on the runtime environment. That is why I asked for the ldd output

As suggested by you I've tried the command ldd name_executable. This is what I get:

      linux-vdso.so.1 (0x00007ffd407ce000)
  libnetcdff.so.6 => not found
  libnetcdf.so.13 => not found
  libmpi_f90.so.1 => not found
  libmpi_f77.so.1 => not found
  libmpi.so.1 => not found
  libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f1c7d475000)
  libhwloc.so.5 => not found
  libgfortran.so.3 => not found
  libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f1c7d326000)
  libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f1c7d30b000)
  libquadmath.so.0 => /lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007f1c7d2c1000)
  libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f1c7d29c000)
  libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f1c7d0aa000)
  /lib64/ld-linux-x86-64.so.2 (0x00007f1c7d498000)

Some of the libraries are not found at runtime, which puzzles me.

I am running the model (rv4_34) on a docker system here at the JRC (Ispra), see issue #76

Did you built your docker image in two stages? If so, might have not copied all the runtime dependencies from the first stage image (build) to the second stage image (runtime). Can I have a look at your docker file?

lucamarletta commented 2 years ago

Hi Alvaro, I post here the Dockerfile used.

It's very old and we have not updated till now because it was working well.

To compile the new exe for emep we did it into the container from this image and I thought that this way was safe because it would link the libs present in the image.

FROM debian:jessie 
LABEL project="IAM-SUL" \
      author="Dario Rodriguez" \
      image_name="" \
      version="1.0" \
      released="2018-12-12" \
      software_versions="OpenMPI 3.0 NCDF 4 Fortran 95" \
      description="EMEP air quality model, for training and validation of the SHERPA simplified model"

ENV DEBIAN_FRONTEND=noninteractive
ENV TERM xterm
ENV DISPLAY :1.0
ENV LC_ALL C.UTF-8

RUN apt-get update && apt-get -yq install gcc gfortran g++\
                      build-essential \
                      tar \
                      bzip2 \
                      m4 \
                      zlib1g-dev \
                      libopenmpi-dev \
                      curl \
              wget

RUN apt-get install -y apt-utils  
RUN  apt-get install -y libnetcdf-dev   

COPY packages/hdf5-1.10.3.tar.bz2 hdf5-1.10.3.tar.bz2
COPY packages/netcdf-c-4.6.2.tar.gz netcdf-c-4.6.2.tar.gz
#COPY packages/netcdf-4.3.3.1.tar.gz netcdf-4.3.3.1.tar.gz
#COPY packages/netcdf-4.3.2.tar.gz netcdf-4.3.2.tar.gz
#COPY packages/netcdf-cxx4-4.2.1.tar.gz netcdf-cxx4-4.2.1.tar.gz
COPY packages/netcdf-fortran-4.4.4.tar.gz netcdf-fortran-4.4.4.tar.gz

#Build HDF5
RUN tar xjvf hdf5-1.10.3.tar.bz2 && \
    cd hdf5-1.10.3 && \
    CC=mpicc ./configure --enable-parallel --prefix=/usr/local && \
    make -j4 && \
    make install && \
    cd .. && \
    rm -rf /hdf5-1.10.3 /hdf5-1.10.3.tar.bz2 

RUN apt-get install -y libcurl3 libcurl4-gnutls-dev 
#Build netcdf
RUN tar xzvf netcdf-c-4.6.2.tar.gz && \
    cd netcdf-c-4.6.2 && \
    ./configure --prefix=/usr \ 
                CC=mpicc \
                LDFLAGS=-L/usr/local/lib \
                CFLAGS=-I/usr/local/include && \
    make -j4 && \
    make install && \
    cd .. && \
rm -rf netcdf-c-4.6.2 netcdf-c-4.6.2.tar.gz

#RUN tar xzvf netcdf-cxx4-4.2.1.tar.gz && \
#    cd netcdf-cxx4-4.2.1 && \
#    ./configure --prefix=/usr/local \ 
#                CC=mpicc \
#                LDFLAGS=-L/usr/local/lib \
#                CFLAGS=-I/usr/local/include && \
#    make check && make -j4 && \
#    make install && \
#    cd .. && \
#rm -rf netcdf-cxx4-4.2.1 netcdf-cxx4-4.2.1.tar.gz
ENV LD_LIBRARY_PATH /usr/local/lib
RUN tar xzvf netcdf-fortran-4.4.4.tar.gz && \
    cd netcdf-fortran-4.4.4 && \
    ./configure --prefix=/usr/local CC=/usr/bin/mpicc FC=/usr/bin/gfortran LDFLAGS=-L/usr/local/lib CFLAGS=-I/usr/local/include && \
    make && make install && \
    cd .. && \
rm -rf netcdf-fortran-4.4.4 netcdf-fortran-4.4.4.tar.gz

##install apt utils and sudo
RUN apt-get install -y sudo 
RUN apt-get install -y openmpi-bin openmpi-common openssh-client openssh-server libopenmpi1.6 libopenmpi1.6-dbg 
ENV MY_HOME=/home/iamsulproc
RUN export uid=35727 gid=41068 \
    && mkdir -p ${MY_HOME} \
    && echo "iamsulproc:x:${uid}:${gid}:iamsulproc,,:${MY_HOME}:/bin/bash" >> /etc/passwd \
    && echo "iamsulproc:x:${uid}:" >> /etc/group \
    && chown ${uid}:${gid} -R ${MY_HOME} \
    && echo "iamsulproc ALL=(ALL) NOPASSWD: ALL" > /etc/sudoers.d/iamsulproc \
    && chmod 0440 /etc/sudoers.d/iamsulproc
RUN apt-get install -y --no-install-recommends \
    vim \
    python-crypto \
    python-dateutil \
    python-dev \
    python-lxml \
    python-numpy \
    python-openssl \
    python-pip \
    python-psycopg2 \
    python-scipy \
    python-urllib3 \
    python-colorama \
    python-distlib \
    python-html5lib \
    python-pkg-resources \
    python-requests \
    python-scipy \
    python-setuptools \
    python-six \
    python-wheel \
    python-pip-whl \
    swig 
RUN apt-get install -y unzip
USER iamsulproc
ENV HOME /home/iamsulproc
CMD /bin/bash

Which procedure do you suggest?

avaldebe commented 2 years ago

To compile the new exe for emep we did it into the container from this image and I thought that this way was safe because it would link the libs present in the image.

This is the base image for the emep model compilation and runtime. Or you're using docker only to compile the model?

lucamarletta commented 2 years ago

We had compiled the model, last time and with this unstable situation, inside the container from this image and in a mounted external folder with the EMEP code.

So, the libs should have been taken from the image itself.

avaldebe commented 2 years ago

We had compiled the model, last time and with this unstable situation, inside the container from this image and in a mounted external folder with the EMEP code.

So, the libs should have been taken from the image itself.

If understand your answer correctly, you're compiling the model from within a Docker container and running the model outside the container.

The Makefile included with the source code is set up for dynamic linking. Therefore, the emepctm binary only include references to the NetCDF and MPI libraries. These references are resolved at compilation time (inside the container). However, some libraries were not found at runtime (outside the container), as shown on your ldd output

  libnetcdff.so.6 => not found
  libnetcdf.so.13 => not found
  libmpi_f90.so.1 => not found
  libmpi_f77.so.1 => not found
  libhwloc.so.5 => not found
  libgfortran.so.3 => not found
lucamarletta commented 2 years ago

Sorry, I explained me badly.

We compiled the model within a container and running itself there.

We just have physically the code out of it in a mounted folder. So, it's inside at compiling time.

But what you point out from the result of LDD is puzzling me. I don't understand this result indeed.

avaldebe commented 2 years ago

We compiled the model within a container and running itself there.

We just have physically the code out of it in a mounted folder. So, it's inside at compiling time.

OK, now is much clearer. Can I see the Dockerfile for the model compilation and runtime? Or more specifically, are you using a multi stage build/runtime image?

The attached file is an example multi stage build for the EMEP MSC-W model from 2 years ago. The image for the build stage has the compiler and all the packages needed to compile the model and required libraries. The runtime image only has the model executable and the library shared objects required to run the model.

Dockerfile.txt