samuelncui / yatm

Yet Another Tape Manager - Full Functional Tape Management System with Web User Interface
BSD 2-Clause "Simplified" License
179 stars 5 forks source link

Web page + service crash upon Loading tape for job with multiple TBs #11

Open caiolopesdasilva opened 8 months ago

caiolopesdasilva commented 8 months ago

Hi Everyone, I'm trying to use LTFS and YATM into an ULTRIUM-HH8 tape reader for the first time and in my tests everything has been going well.

However, I have this issue where I add directories with multiple TBs of data to be loaded into an LTO-8 tape, upon selecting a tape and adding the barcode. my browser starts to consume all of my RAM and CPU, the page crashes and in the logs, YATM is getting stuck in mounting the tape.

I have to keep restarting the service and trying to delete the job multiple times until I am able to get access to the web gui again. in some of my smaller tests of 50GB single files being sent to tape it works ok.

This is running in an actual Dell PowerEdge 7415 server running CentOS 7.9

1x AMD EPYC 7401P 128GB RAM 256 O.S SSD

So I find it difficult that this would happen due to hardware limitations.

Does anyone have a clue of what can it be? When I issue LTFS commands and send data to the tape via CLI everything works fine.

Any help would be appreciated

Here are some logs:

time="2024-03-04T09:58:45+01:00" level=info msg="walk source all finished, get 251890 sources"
time="2024-03-04T10:06:00+01:00" level=info msg="+ mt -f /dev/nst0 load"
time="2024-03-04T10:06:00+01:00" level=info msg="+ sleep 5"
time="2024-03-04T10:06:05+01:00" level=info msg="+ sleep 3"
time="2024-03-04T10:06:08+01:00" level=info msg="+++ dirname ./scripts/mkfs"
time="2024-03-04T10:06:08+01:00" level=info msg="++ cd ./scripts"
time="2024-03-04T10:06:08+01:00" level=info msg="++ pwd"
time="2024-03-04T10:06:08+01:00" level=info msg="+ CURDIR=/opt/yatm/scripts"
time="2024-03-04T10:06:08+01:00" level=info msg="++ /opt/yatm/scripts/get_device"
time="2024-03-04T10:06:08+01:00" level=info msg="+ SG_DEVICE=/dev/sg8"
time="2024-03-04T10:06:08+01:00" level=info msg="+ /opt/ibm/ltfssde/bin/mkltfs -f -d /dev/sg8 -s 001130 -n GA-1130"
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS9015W Setting the locale to 'en_US.UTF-8'. If this is wrong, please set the LANG environment variable before starting mkltfs."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS15000I Starting mkltfs, LTFS version 2.4.6.1 (10511), log level 2."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS15041I Launched by \"/opt/ibm/ltfssde/bin/mkltfs -f -d /dev/sg8 -s 001130 -n GA-1130\"."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS15042I This binary is built for Linux (x86_64)."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS15043I GCC version is 4.8.5 20150623 (Red Hat 4.8.5-44)."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS17087I Kernel version: Linux version 3.10.0-1160.102.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Tue Oct 17 15:42:21 UTC 2023 i386."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS17089I Distribution: CentOS Linux release 7.9.2009 (Core)."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS17089I Distribution: NAME=\"CentOS Linux\"."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS17089I Distribution: CentOS Linux release 7.9.2009 (Core)."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS17089I Distribution: CentOS Linux release 7.9.2009 (Core)."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS15003I Formatting device '/dev/sg8'."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS15004I LTFS volume blocksize: 524288."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS15005I Index partition placement policy: None."
time="2024-03-04T10:06:08+01:00" level=info
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS11337I Update index-dirty flag (1) - 001130 (0x0xcf0220)."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS17085I Plugin: Loading \"sg\" tape backend."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS30209I Opening a device through sg-ibmtape driver (/dev/sg8)."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS30250I Opened the SCSI tape device 3.0.1.0 (/dev/sg8)."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS30207I Vendor ID is QUANTUM ."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS30208I Product ID is ULTRIUM-HH8     ."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS30214I Firmware revision is K4K5."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS30215I Drive serial is 10WT060567."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS30285I The reserved buffer size of /dev/sg8 is 524288."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS30294I Setting up timeout values from RSOC."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS17160I Maximum device block size is 1048576."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS11330I Loading cartridge."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS11332I Load successful."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS17157I Changing the drive setting to write-anywhere mode."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS15049I Checking the medium (load)."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS15010I Creating data partition b on SCSI partition 1."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS15011I Creating index partition a on SCSI partition 0."
time="2024-03-04T10:06:08+01:00" level=info msg="LTFS17165I Resetting the medium's capacity proportion."
time="2024-03-04T10:06:16+01:00" level=info msg="LTFS11097I Partitioning the medium."
time="2024-03-04T10:06:16+01:00" level=info msg="LTFS30263I MODESELECT returns Mode Parameters Rounded (-20101) /dev/sg8."
time="2024-03-04T10:06:37+01:00" level=info msg="LTFS11100I Writing label to partition b."
time="2024-03-04T10:06:44+01:00" level=info msg="LTFS11278I Writing index to partition b."
time="2024-03-04T10:06:44+01:00" level=info msg="LTFS30205I READ_ATTR (0x8c) returns -20501."
time="2024-03-04T10:06:44+01:00" level=info msg="LTFS30263I READ_ATTR returns Invalid Field in CDB (-20501) /dev/sg8."
time="2024-03-04T10:06:44+01:00" level=info msg="LTFS30233I Cannot read attribute (-20501)."
time="2024-03-04T10:06:44+01:00" level=info msg="LTFS11336I The attribute does not exist. Ignore the expected error."
time="2024-03-04T10:06:44+01:00" level=info msg="LTFS17235I Writing index of 001130 to b (Reason: Format, 0 files) 10WT060567."
time="2024-03-04T10:06:51+01:00" level=info msg="LTFS17236I Wrote index of 001130 (Gen = 1, Part = b, Pos = 5, 10WT060567)."
time="2024-03-04T10:06:51+01:00" level=info msg="LTFS11337I Update index-dirty flag (0) - 001130 (0x0xcf0220)."
time="2024-03-04T10:06:51+01:00" level=info msg="LTFS11100I Writing label to partition a."
time="2024-03-04T10:06:55+01:00" level=info msg="LTFS11278I Writing index to partition a."
time="2024-03-04T10:06:55+01:00" level=info msg="LTFS30205I READ_ATTR (0x8c) returns -20501."
time="2024-03-04T10:06:55+01:00" level=info msg="LTFS30263I READ_ATTR returns Invalid Field in CDB (-20501) /dev/sg8."
time="2024-03-04T10:06:55+01:00" level=info msg="LTFS30233I Cannot read attribute (-20501)."
time="2024-03-04T10:06:55+01:00" level=info msg="LTFS11336I The attribute does not exist. Ignore the expected error."
time="2024-03-04T10:07:00+01:00" level=info msg="LTFS17235I Writing index of 001130 to a (Reason: Format, 0 files) 10WT060567."
time="2024-03-04T10:07:05+01:00" level=info msg="LTFS17236I Wrote index of 001130 (Gen = 1, Part = a, Pos = 5, 10WT060567)."
time="2024-03-04T10:07:05+01:00" level=info msg="LTFS15013I Volume UUID is: ad1ad853-edb4-4e2b-bd74-75172b02ca23."
time="2024-03-04T10:07:05+01:00" level=info
time="2024-03-04T10:07:05+01:00" level=info msg="LTFS15019I Volume capacity is 11712 GB."
time="2024-03-04T10:07:21+01:00" level=info msg="LTFS30252I Logical block protection is disabled."
time="2024-03-04T10:07:21+01:00" level=info msg="LTFS15024I Medium formatted successfully."
time="2024-03-04T10:07:21+01:00" level=info msg="+ sleep 3"
time="2024-03-04T10:07:24+01:00" level=info msg="+++ dirname ./scripts/mount"
time="2024-03-04T10:07:24+01:00" level=info msg="++ cd ./scripts"
time="2024-03-04T10:07:24+01:00" level=info msg="++ pwd"
time="2024-03-04T10:07:24+01:00" level=info msg="+ CURDIR=/opt/yatm/scripts"
time="2024-03-04T10:07:24+01:00" level=info msg="++ /opt/yatm/scripts/get_device"
time="2024-03-04T10:07:24+01:00" level=info msg="+ SG_DEVICE=/dev/sg8"
time="2024-03-04T10:07:24+01:00" level=info msg="+ /opt/ibm/ltfssde/bin/ltfs -o devname=/dev/sg8 -o noatime -o sync_type=unmount -o work_directory=/opt/yatm/captured_indices -o capture_index -o min_pool_size=256 -o max_pool_size=1024 -o eject -s /tmp/3212318303.ltfs"
time="2024-03-04T10:07:24+01:00" level=info msg="LTFS9015W Setting the locale to 'en_US.UTF-8'. If this is wrong, please set the LANG environment variable before starting ltfs."
time="2024-03-04T10:07:24+01:00" level=info msg="9471 LTFS14000I LTFS starting, LTFS version 2.4.6.1 (10511), log level 2."
time="2024-03-04T10:07:24+01:00" level=info msg="9471 LTFS14058I LTFS Format Specification version 2.4.0."
time="2024-03-04T10:07:24+01:00" level=info msg="9471 LTFS14104I Launched by \"/opt/ibm/ltfssde/bin/ltfs -o devname=/dev/sg8 -o noatime -o sync_type=unmount -o work_directory=/opt/yatm/captured_indices -o capture_index -o min_pool_size=256 -o max_pool_size=1024 -o eject -s /tmp/3212318303.ltfs\"."
time="2024-03-04T10:07:24+01:00" level=info msg="9471 LTFS14105I This binary is built for Linux (x86_64)."
time="2024-03-04T10:07:24+01:00" level=info msg="9471 LTFS14106I GCC version is 4.8.5 20150623 (Red Hat 4.8.5-44)."
time="2024-03-04T10:07:24+01:00" level=info msg="9471 LTFS17087I Kernel version: Linux version 3.10.0-1160.102.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Tue Oct 17 15:42:21 UTC 2023 i386."
time="2024-03-04T10:07:24+01:00" level=info msg="9471 LTFS17089I Distribution: CentOS Linux release 7.9.2009 (Core)."
time="2024-03-04T10:07:24+01:00" level=info msg="9471 LTFS17089I Distribution: NAME=\"CentOS Linux\"."
time="2024-03-04T10:07:24+01:00" level=info msg="9471 LTFS17089I Distribution: CentOS Linux release 7.9.2009 (Core)."
time="2024-03-04T10:07:24+01:00" level=info msg="9471 LTFS17089I Distribution: CentOS Linux release 7.9.2009 (Core)."
time="2024-03-04T10:07:24+01:00" level=info msg="9471 LTFS14064I Sync type is \"unmount\"."
time="2024-03-04T10:07:24+01:00" level=info msg="9471 LTFS17085I Plugin: Loading \"sg\" tape backend."
time="2024-03-04T10:07:24+01:00" level=info msg="9471 LTFS17085I Plugin: Loading \"unified\" iosched backend."
time="2024-03-04T10:07:24+01:00" level=info msg="9471 LTFS30209I Opening a device through sg-ibmtape driver (/dev/sg8)."
time="2024-03-04T10:07:24+01:00" level=info msg="9471 LTFS30250I Opened the SCSI tape device 3.0.1.0 (/dev/sg8)."
time="2024-03-04T10:07:24+01:00" level=info msg="9471 LTFS30207I Vendor ID is QUANTUM ."
time="2024-03-04T10:07:24+01:00" level=info msg="9471 LTFS30208I Product ID is ULTRIUM-HH8     ."
time="2024-03-04T10:07:24+01:00" level=info msg="9471 LTFS30214I Firmware revision is K4K5."
time="2024-03-04T10:07:24+01:00" level=info msg="9471 LTFS30215I Drive serial is 10WT060567."
time="2024-03-04T10:07:24+01:00" level=info msg="9471 LTFS30285I The reserved buffer size of /dev/sg8 is 524288."
time="2024-03-04T10:07:24+01:00" level=info msg="9471 LTFS30294I Setting up timeout values from RSOC."
time="2024-03-04T10:07:24+01:00" level=info msg="9471 LTFS17160I Maximum device block size is 1048576."
time="2024-03-04T10:07:24+01:00" level=info msg="9471 LTFS11330I Loading cartridge."
time="2024-03-04T10:07:27+01:00" level=info msg="9471 LTFS11332I Load successful."
time="2024-03-04T10:07:27+01:00" level=info msg="9471 LTFS17157I Changing the drive setting to append-only mode."
time="2024-03-04T10:07:27+01:00" level=info msg="9471 LTFS11005I Mounting the volume."
time="2024-03-04T10:07:43+01:00" level=info msg="9471 LTFS17227I Tape attribute: Vendor = IBM     ."
time="2024-03-04T10:07:43+01:00" level=info msg="9471 LTFS17227I Tape attribute: Application Name = LTFS                            ."
time="2024-03-04T10:07:43+01:00" level=info msg="9471 LTFS17227I Tape attribute: Application Version = 2.4.6.1 ."
time="2024-03-04T10:07:43+01:00" level=info msg="9471 LTFS17227I Tape attribute: Medium Label = GA-1130."
time="2024-03-04T10:07:43+01:00" level=info msg="9471 LTFS17228I Tape attribute: Text Localization ID = 0x81."
time="2024-03-04T10:07:43+01:00" level=info msg="9471 LTFS17227I Tape attribute: Barcode = 001130                          ."
time="2024-03-04T10:07:43+01:00" level=info msg="9471 LTFS17227I Tape attribute: Application Format Version = 2.4.0           ."
time="2024-03-04T10:07:43+01:00" level=info msg="9471 LTFS17228I Tape attribute: Volume Lock Status = 0x00."
time="2024-03-04T10:07:43+01:00" level=info msg="9471 LTFS17227I Tape attribute: Media Pool name = ."
time="2024-03-04T10:07:43+01:00" level=info msg="9471 LTFS11031I Volume mounted successfully. 001130 : Gen = 1 / (a, 5) -> (b, 5) / 10WT060567."
time="2024-03-04T10:07:43+01:00" level=info msg="9471 LTFS14111I Initial setup completed successfully."
time="2024-03-04T10:07:43+01:00" level=info msg="9471 LTFS14112I Invoke 'mount' command to check the result of final setup."
time="2024-03-04T10:07:43+01:00" level=info msg="9471 LTFS14113I Specified mount point is listed if succeeded."
time="2024-03-04T10:07:43+01:00" level=info msg="+ sleep 3"
time="2024-03-04T10:07:46+01:00" level=info msg="++ df /tmp/3212318303.ltfs --output=target"
time="2024-03-04T10:07:46+01:00" level=info msg="++ sed -n '1!p'"
time="2024-03-04T10:07:46+01:00" level=info msg="+ MOUNT_POINT_TARGET=/tmp/3212318303.ltfs"
time="2024-03-04T10:07:46+01:00" level=info msg="+ [[ /tmp/3212318303.ltfs != /tmp/3212318303.ltfs ]]"
time="2024-03-04T10:07:46+01:00" level=info msg="+ echo 'mount '\\''/tmp/3212318303.ltfs'\\'' success'"
time="2024-03-04T10:07:46+01:00" level=info msg="mount '/tmp/3212318303.ltfs' success"
time="2024-03-04T10:14:33+01:00" level=warning msg="update state failed while save job" error="save job fail, err= sql: transaction has already been committed or rolled back"
time="2024-03-04T10:14:33+01:00" level=warning msg="update state failed while save job" error="save job fail, err= context canceled"
time="2024-03-04T10:14:33+01:00" level=warning msg="update state failed while save job" error="save job fail, err= context canceled"
time="2024-03-04T10:14:33+01:00" level=warning msg="update state failed while save job" error="save job fail, err= context canceled"
[root@server ~]# systemctl status yatm-httpd.service -l
● yatm-httpd.service - YATM Service
   Loaded: loaded (/opt/yatm/yatm-httpd.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2024-03-04 09:43:05 CET; 29min ago
     Docs: https://github.com/samuelncui/yatm/
 Main PID: 30645 (yatm-httpd)
   CGroup: /system.slice/yatm-httpd.service
           ├─30645 /opt/yatm/yatm-httpd
           └─38066 /opt/ibm/ltfssde/bin/ltfs -o devname=/dev/sg8 -o noatime -o sync_type=unmount -o work_directory=/opt/yatm/captured_indices -o capture_index -o min_pool_size=256 -o max_pool_size=1024 -o eject -s /tmp/3212318303.ltfs

Mar 04 10:12:59 server yatm-httpd[30645]: 2024/03/04 10:12:59 /home/runner/work/yatm/yatm/executor/job.go:53 SLOW SQL >= 200ms
Mar 04 10:12:59 server yatm-httpd[30645]: [225.397ms] [rows:1] UPDATE `jobs` SET `status`=3,`priority`=1,`state`="<binary>",`create_time`="2024-03-04 09:58:45.972",`update_time`="2024-03-04 10:12:58.964" WHERE `id` = 15
Mar 04 10:12:59 server yatm-httpd[30645]: 2024/03/04 10:12:59 /home/runner/work/yatm/yatm/executor/job.go:53 SLOW SQL >= 200ms
Mar 04 10:12:59 server yatm-httpd[30645]: [249.348ms] [rows:1] UPDATE `jobs` SET `status`=3,`priority`=1,`state`="<binary>",`create_time`="2024-03-04 09:58:45.972",`update_time`="2024-03-04 10:12:59.414" WHERE `id` = 15
Mar 04 10:13:00 server yatm-httpd[30645]: 2024/03/04 10:13:00 /home/runner/work/yatm/yatm/executor/job.go:53 SLOW SQL >= 200ms
Mar 04 10:13:00 server yatm-httpd[30645]: [207.976ms] [rows:1] UPDATE `jobs` SET `status`=3,`priority`=1,`state`="<binary>",`create_time`="2024-03-04 09:58:45.972",`update_time`="2024-03-04 10:12:59.882" WHERE `id` = 15
Mar 04 10:13:00 server yatm-httpd[30645]: 2024/03/04 10:13:00 /home/runner/work/yatm/yatm/executor/job.go:53 SLOW SQL >= 200ms
Mar 04 10:13:00 server yatm-httpd[30645]: [210.772ms] [rows:1] UPDATE `jobs` SET `status`=3,`priority`=1,`state`="<binary>",`create_time`="2024-03-04 09:58:45.972",`update_time`="2024-03-04 10:13:00.292" WHERE `id` = 15
Mar 04 10:13:01 server yatm-httpd[30645]: 2024/03/04 10:13:01 /home/runner/work/yatm/yatm/executor/job.go:53 SLOW SQL >= 200ms
Mar 04 10:13:01 server yatm-httpd[30645]: [258.660ms] [rows:1] UPDATE `jobs` SET `status`=3,`priority`=1,`state`="<binary>",`create_time`="2024-03-04 09:58:45.972",`update_time`="2024-03-04 10:13:00.736" WHERE `id` = 15
samuelncui commented 8 months ago

How many files are you trying to backup in this job? Could you give me the number of files, so I can reproduce this bug?

It's a known issue that if a job contains too many files, it might get very slow to load the web gui. But I don't know it can be this extreme.

caiolopesdasilva commented 8 months ago

It's a whole job from a company's storage server. There are cases that a job can be as big as a whole 13TB tape, this is one of the "small one" for us

1,48 TB 252.101 Files, 5.155 Folders

When I created a mount point to the tape, formatted in ltfs. I used a simple rsync -auvl --info=progress from this job to the tape. It took its time but it went well. so yeah its something about YATM that is not able to handle such a large number.

If there's anything I can do to help let me know. I will be out for the next three weeks but after I will be around. YATM is a very interesting tool for us

samuelncui commented 8 months ago

Thx for your Infomation, I will improve this in the future.