openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.46k stars 1.73k forks source link

Crashing under Gentoo 6.10.6 #16502

Open mauricev opened 2 weeks ago

mauricev commented 2 weeks ago

I have the most recent zfs-9999 installation (I assume this is a recent github snapshot) of ZFS installed on Gentoo x64 with gentoo-sources kernel 6.10.6. I have one mirrored zpool. I am seeing occasional kernel panics.

Screenshot 2024-08-30 at 2 16 48 PM

Screenshot 2024-09-03 at 7 10 35 PM

The pool passes scrub with no errors.

I'm not sure of any other way to document panics. I have another similar system slightly older running 6.10.5 and it has yet to crash.

robn commented 2 weeks ago

@mauricev can you please update this with the information requested in the issue template. We need that info to begin triage.

mauricev commented 2 weeks ago

Distribution Name | Gentoo Distribution Version | emerged in the last week Kernel Version | 6.10.6-gentoo-x86_64 Architecture | x64 OpenZFS Version | zfs-2.2.99-529_g23a489a41 zfs-kmod-2.2.99-687_gb3b749161

robn commented 2 weeks ago

Are you doing something in particular at the time this happens? I see docker and overlay filesystems are in play; does this coincide with a particular docker activity? You say "occasional", so it would be nice to line this up with a particular action.

Did anything change on this system recently? You mention another system on 6.10.5, and a recent upgrade. Was this system running that older kernel before? Did this happen then? If you're able to downgrade your kernel, I'd be interested to see if anything changes there.

Are you able to try with a release OpenZFS (say 2.2.5)?

mauricev commented 2 weeks ago

This is a new installation replacing an older system. This new system has a new, third docker container and when I build the image for it, about 50% of the time, this crash will happen. However, it just happened again today unrelated to my building a docker image. There is another nearly identical server running 6.10.5, but without docker. It has not crashed so far. Could a low-memory condition trigger zfs to crash this way? I think I would have to revert the kernel version to install zfs 2.2.5.

snajpa commented 2 weeks ago

Can you please share the build command/Dockerfile? 50% luck is good enough, I'll give it a try (need to make sure 6.10 runs well, we'd like to move onto it at vpsFree)

mauricev commented 2 weeks ago
# Use Python 3.12-slim as the base image
FROM python:3.12-slim

# Set the working directory in the container
WORKDIR /app

# Install dependencies for OpenCV, libgthread, and CA certificates
RUN apt-get update && apt-get install -y \
    libgl1-mesa-glx \
    libglib2.0-0 \
    cron \
    procps \
    curl \
    ca-certificates \
    && apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Copy the current directory contents into the container at /app
COPY . /app

# Upgrade pip and setuptools to ensure compatibility with SSL/TLS
RUN pip install --upgrade pip setuptools

# Install Python dependencies using pip and the requirements.txt file
RUN pip install --no-cache-dir --trusted-host pypi.org --trusted-host files.pythonhosted.org --no-cache -r requirements.txt

# Create the uploadedImages directory if it doesn't exist
RUN mkdir -p /app/static/uploadedImages

# Expose the port that your Flask app will run on internally
EXPOSE 5000

# Define environment variable for Flask
ENV FLASK_APP=main.py

# Copy the start script into the container
COPY start.sh /app/start.sh

# Set the entry point to the start script
ENTRYPOINT ["/app/start.sh"]
snajpa commented 2 weeks ago

I had no luck, left that running in a loop over night, no crash :( Can you describe how the pool and datasets are set up? What properties are set, etc.?

mauricev commented 2 weeks ago
  pool: spool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:38 with 0 errors on Fri Aug 30 12:14:01 2024
config:

    NAME        STATE     READ WRITE CKSUM
    spool       ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        vdb     ONLINE       0     0     0
        vdc     ONLINE       0     0     0

errors: No known data errors
NAME               USED  AVAIL  REFER  MOUNTPOINT
spool             7.13G  7.89G    96K  /spool
spool/docker      1.91G  7.89G  1.91G  /spool/docker
spool/kevin       93.4M  7.89G  93.4M  /var/www/einsteinmedneuroscience/kevin
spool/mysql-else   382M  7.89G   374M  /spool/mysql-else
spool/mysql-wp    1.97G  7.89G  1.61G  /spool/mysql-wp
spool/odes         269M  7.89G   269M  /spool/odes
spool/wp          2.50G  7.89G  2.49G  /var/www/localhost/htdocs/neuroscience
NAME   PROPERTY                       VALUE                          SOURCE
spool  size                           15.5G                          -
spool  capacity                       45%                            -
spool  altroot                        -                              default
spool  health                         ONLINE                         -
spool  guid                           7238931578768070567            -
spool  version                        -                              default
spool  bootfs                         -                              default
spool  delegation                     on                             default
spool  autoreplace                    off                            default
spool  cachefile                      -                              default
spool  failmode                       wait                           default
spool  listsnapshots                  off                            default
spool  autoexpand                     off                            default
spool  dedupratio                     1.00x                          -
spool  free                           8.37G                          -
spool  allocated                      7.13G                          -
spool  readonly                       off                            -
spool  ashift                         12                             local
spool  comment                        -                              default
spool  expandsize                     -                              -
spool  freeing                        0                              -
spool  fragmentation                  16%                            -
spool  leaked                         0                              -
spool  multihost                      off                            default
spool  checkpoint                     -                              -
spool  load_guid                      7771301092460470127            -
spool  autotrim                       off                            default
spool  compatibility                  off                            default
spool  bcloneused                     0                              -
spool  bclonesaved                    0                              -
spool  bcloneratio                    1.00x                          -
spool  feature@async_destroy          enabled                        local
spool  feature@empty_bpobj            active                         local
spool  feature@lz4_compress           active                         local
spool  feature@multi_vdev_crash_dump  enabled                        local
spool  feature@spacemap_histogram     active                         local
spool  feature@enabled_txg            active                         local
spool  feature@hole_birth             active                         local
spool  feature@extensible_dataset     active                         local
spool  feature@embedded_data          active                         local
spool  feature@bookmarks              enabled                        local
spool  feature@filesystem_limits      enabled                        local
spool  feature@large_blocks           enabled                        local
spool  feature@large_dnode            enabled                        local
spool  feature@sha512                 enabled                        local
spool  feature@skein                  enabled                        local
spool  feature@edonr                  enabled                        local
spool  feature@userobj_accounting     active                         local
spool  feature@encryption             enabled                        local
spool  feature@project_quota          active                         local
spool  feature@device_removal         enabled                        local
spool  feature@obsolete_counts        enabled                        local
spool  feature@zpool_checkpoint       enabled                        local
spool  feature@spacemap_v2            active                         local
spool  feature@allocation_classes     enabled                        local
spool  feature@resilver_defer         enabled                        local
spool  feature@bookmark_v2            enabled                        local
spool  feature@redaction_bookmarks    enabled                        local
spool  feature@redacted_datasets      enabled                        local
spool  feature@bookmark_written       enabled                        local
spool  feature@log_spacemap           active                         local
spool  feature@livelist               enabled                        local
spool  feature@device_rebuild         enabled                        local
spool  feature@zstd_compress          enabled                        local
spool  feature@draid                  enabled                        local
spool  feature@zilsaxattr             active                         local
spool  feature@head_errlog            active                         local
spool  feature@blake3                 enabled                        local
spool  feature@block_cloning          enabled                        local
spool  feature@vdev_zaps_v2           active                         local
spool  feature@redaction_list_spill   enabled                        local
spool  feature@raidz_expansion        enabled                        local

I had run the docker build twice and it did crash again once. When I run the command on the staging server, it never crashes but that has only btrfs disks.

snajpa commented 2 weeks ago

Maybe it's something in requirements.txt? There was also start.sh missing, I solved both by touching an empty file; perhaps something's happening with the python-stuff. Could you please supply that too?

mauricev commented 2 weeks ago

start.sh

#!/bin/bash

# Function to handle SIGTERM
trap 'kill -TERM $PID' TERM INT

# Start cron in the background
/usr/sbin/cron &

# Start gunicorn
gunicorn -w 4 -b 0.0.0.0:5000 main:app &

# Capture PID of gunicorn
PID=$!

# Wait for gunicorn process
wait $PID

requirements.txt

numpy
pillow
opencv-python-headless
pandas
ultralytics
flask
gunicorn
torch==2.4.0

Come to think of it, the crash always occurs during the processing of requirements.

snajpa commented 2 weeks ago

I left it looping for 8 hours, nothing :(

snajpa commented 2 weeks ago

Uh I enabled block cloning in hope to reproduce this, perhaps it could be related... - only to see endless txg syncs with no data written whatsoever. OK, got it, continuing to keep that feature off and recreating my dev pool now :-D