Open mauricev opened 2 weeks ago
@mauricev can you please update this with the information requested in the issue template. We need that info to begin triage.
Distribution Name | Gentoo Distribution Version | emerged in the last week Kernel Version | 6.10.6-gentoo-x86_64 Architecture | x64 OpenZFS Version | zfs-2.2.99-529_g23a489a41 zfs-kmod-2.2.99-687_gb3b749161
Are you doing something in particular at the time this happens? I see docker and overlay filesystems are in play; does this coincide with a particular docker activity? You say "occasional", so it would be nice to line this up with a particular action.
Did anything change on this system recently? You mention another system on 6.10.5, and a recent upgrade. Was this system running that older kernel before? Did this happen then? If you're able to downgrade your kernel, I'd be interested to see if anything changes there.
Are you able to try with a release OpenZFS (say 2.2.5)?
This is a new installation replacing an older system. This new system has a new, third docker container and when I build the image for it, about 50% of the time, this crash will happen. However, it just happened again today unrelated to my building a docker image. There is another nearly identical server running 6.10.5, but without docker. It has not crashed so far. Could a low-memory condition trigger zfs to crash this way? I think I would have to revert the kernel version to install zfs 2.2.5.
Can you please share the build command/Dockerfile? 50% luck is good enough, I'll give it a try (need to make sure 6.10 runs well, we'd like to move onto it at vpsFree)
# Use Python 3.12-slim as the base image
FROM python:3.12-slim
# Set the working directory in the container
WORKDIR /app
# Install dependencies for OpenCV, libgthread, and CA certificates
RUN apt-get update && apt-get install -y \
libgl1-mesa-glx \
libglib2.0-0 \
cron \
procps \
curl \
ca-certificates \
&& apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Copy the current directory contents into the container at /app
COPY . /app
# Upgrade pip and setuptools to ensure compatibility with SSL/TLS
RUN pip install --upgrade pip setuptools
# Install Python dependencies using pip and the requirements.txt file
RUN pip install --no-cache-dir --trusted-host pypi.org --trusted-host files.pythonhosted.org --no-cache -r requirements.txt
# Create the uploadedImages directory if it doesn't exist
RUN mkdir -p /app/static/uploadedImages
# Expose the port that your Flask app will run on internally
EXPOSE 5000
# Define environment variable for Flask
ENV FLASK_APP=main.py
# Copy the start script into the container
COPY start.sh /app/start.sh
# Set the entry point to the start script
ENTRYPOINT ["/app/start.sh"]
I had no luck, left that running in a loop over night, no crash :( Can you describe how the pool and datasets are set up? What properties are set, etc.?
pool: spool
state: ONLINE
scan: scrub repaired 0B in 00:00:38 with 0 errors on Fri Aug 30 12:14:01 2024
config:
NAME STATE READ WRITE CKSUM
spool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
vdb ONLINE 0 0 0
vdc ONLINE 0 0 0
errors: No known data errors
NAME USED AVAIL REFER MOUNTPOINT
spool 7.13G 7.89G 96K /spool
spool/docker 1.91G 7.89G 1.91G /spool/docker
spool/kevin 93.4M 7.89G 93.4M /var/www/einsteinmedneuroscience/kevin
spool/mysql-else 382M 7.89G 374M /spool/mysql-else
spool/mysql-wp 1.97G 7.89G 1.61G /spool/mysql-wp
spool/odes 269M 7.89G 269M /spool/odes
spool/wp 2.50G 7.89G 2.49G /var/www/localhost/htdocs/neuroscience
NAME PROPERTY VALUE SOURCE
spool size 15.5G -
spool capacity 45% -
spool altroot - default
spool health ONLINE -
spool guid 7238931578768070567 -
spool version - default
spool bootfs - default
spool delegation on default
spool autoreplace off default
spool cachefile - default
spool failmode wait default
spool listsnapshots off default
spool autoexpand off default
spool dedupratio 1.00x -
spool free 8.37G -
spool allocated 7.13G -
spool readonly off -
spool ashift 12 local
spool comment - default
spool expandsize - -
spool freeing 0 -
spool fragmentation 16% -
spool leaked 0 -
spool multihost off default
spool checkpoint - -
spool load_guid 7771301092460470127 -
spool autotrim off default
spool compatibility off default
spool bcloneused 0 -
spool bclonesaved 0 -
spool bcloneratio 1.00x -
spool feature@async_destroy enabled local
spool feature@empty_bpobj active local
spool feature@lz4_compress active local
spool feature@multi_vdev_crash_dump enabled local
spool feature@spacemap_histogram active local
spool feature@enabled_txg active local
spool feature@hole_birth active local
spool feature@extensible_dataset active local
spool feature@embedded_data active local
spool feature@bookmarks enabled local
spool feature@filesystem_limits enabled local
spool feature@large_blocks enabled local
spool feature@large_dnode enabled local
spool feature@sha512 enabled local
spool feature@skein enabled local
spool feature@edonr enabled local
spool feature@userobj_accounting active local
spool feature@encryption enabled local
spool feature@project_quota active local
spool feature@device_removal enabled local
spool feature@obsolete_counts enabled local
spool feature@zpool_checkpoint enabled local
spool feature@spacemap_v2 active local
spool feature@allocation_classes enabled local
spool feature@resilver_defer enabled local
spool feature@bookmark_v2 enabled local
spool feature@redaction_bookmarks enabled local
spool feature@redacted_datasets enabled local
spool feature@bookmark_written enabled local
spool feature@log_spacemap active local
spool feature@livelist enabled local
spool feature@device_rebuild enabled local
spool feature@zstd_compress enabled local
spool feature@draid enabled local
spool feature@zilsaxattr active local
spool feature@head_errlog active local
spool feature@blake3 enabled local
spool feature@block_cloning enabled local
spool feature@vdev_zaps_v2 active local
spool feature@redaction_list_spill enabled local
spool feature@raidz_expansion enabled local
I had run the docker build twice and it did crash again once. When I run the command on the staging server, it never crashes but that has only btrfs disks.
Maybe it's something in requirements.txt
? There was also start.sh
missing, I solved both by touching an empty file; perhaps something's happening with the python-stuff. Could you please supply that too?
start.sh
#!/bin/bash
# Function to handle SIGTERM
trap 'kill -TERM $PID' TERM INT
# Start cron in the background
/usr/sbin/cron &
# Start gunicorn
gunicorn -w 4 -b 0.0.0.0:5000 main:app &
# Capture PID of gunicorn
PID=$!
# Wait for gunicorn process
wait $PID
requirements.txt
numpy
pillow
opencv-python-headless
pandas
ultralytics
flask
gunicorn
torch==2.4.0
Come to think of it, the crash always occurs during the processing of requirements.
I left it looping for 8 hours, nothing :(
Uh I enabled block cloning in hope to reproduce this, perhaps it could be related... - only to see endless txg syncs with no data written whatsoever. OK, got it, continuing to keep that feature off and recreating my dev pool now :-D
I have the most recent zfs-9999 installation (I assume this is a recent github snapshot) of ZFS installed on Gentoo x64 with gentoo-sources kernel 6.10.6. I have one mirrored zpool. I am seeing occasional kernel panics.
The pool passes scrub with no errors.
I'm not sure of any other way to document panics. I have another similar system slightly older running 6.10.5 and it has yet to crash.