openSUSE / libsolv

Library for solving packages and reading repositories
http://en.opensuse.org/openSUSE:Libzypp_satsolver
Other
509 stars 151 forks source link

segfault from libsolv during dependency checking for reposync from source with large metadata file. #543

Open filsdepatrick opened 7 months ago

filsdepatrick commented 7 months ago

Hello, My team manages the repository-mirror service at my company. We mirror rpm repos from our devops team and we seem to have hit a bug with libsolv on ol8. When using reposync to mirror a repo with a large (527 MB) metadata file (xxxxxx-filelists.xml.gz), Reposync segfaults with a coredump and this error message: 2023-11-14T03:06:57.243146-08:00 repository-mirror002 kernel: [45200.322222] Code: 89 45 00 48 01 f0 48 8d 50 01 81 fb ff 1f 00 00 76 35 81 fb ff ff ff 07 77 55 81 fb ff ff 0f 00 77 6c 89 d9 c1 e9 0d 83 c9 80 <88> 08 89 d9 48 8d 42 01 48 83 c2 02 c1 e9 06 83 c9 80 88 4a fe eb 2023-11-14T03:11:41.520088-08:00 repository-mirror002. kernel: [45484.589441] reposync[1210499]: segfault at 7f93641f8013 ip 00007f9520a9a6bb sp 00007ffe9edf7250 error 6 in libsolv.so.1[7f9520a62000+90000] This error reproduces with the distro version of the libsolv package installed (0.7.20-4) as well as with the version of libsolv from ol9 (0.7.22-4), as well as with the latest version from https://github.com/openSUSE/libsolv (0.7.26) The segfault occurs after the metadata from the source mirror is downloaded completely, and while memory is being re-allocated during the package dependency analysis that reposync initiates via calls to the libsolv library. There is some gdb output that was collected during the segfault that I'll paste below

strace output of reposync failure:

using libsolv-0.7.20-4:

mremap(0x7f3c9bbf2000, 2080378880, 2147487744, MREMAP_MAYMOVE) = 0x7f3c9bbf2000 mremap(0x7f3c9bbf2000, 2147487744, 2281705472, MREMAP_MAYMOVE) = 0x7f3c9bbf2000 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x7f3c1bbf2013} --- +++ killed by SIGSEGV (core dumped) +++ Segmentation fault (core dumped)

using libsolv-0.7.26:

mremap(0x7f8b3a6f3000, 2147487744, 2281705472, MREMAP_MAYMOVE) = 0x7f8c4e6f7000 mremap(0x7f8c4e6f7000, 2281705472, 18446744071562072064, MREMAP_MAYMOVE) = -1 ENOMEM (Cannot allocate memory) mmap(NULL, 18446744071562072064, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) mmap(NULL, 18446744071562207232, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) mmap(NULL, 18446744071562072064, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) write(2, "Out of memory allocating 1844674"..., 53Out of memory allocating 18446744071562070016 bytes! ) = 53 rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0 getpid() = 2451293 gettid() = 2451293 tgkill(2451293, 2451293, SIGABRT) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=2451293, si_uid=0} --- +++ killed by SIGABRT (core dumped) +++ Aborted (core dumped)

(gdb) bt full

0 0x00007ffff6701f6f in __memmove_avx_unaligned_erms () from /lib64/libc.so.6

No symbol table info available.

1 0x00007ffff2f86784 in data_addblob.isra () from /lib64/libsolv.so.1

No symbol table info available.

2 0x00007ffff2f87f22 in repodata_serialize_key.isra () from /lib64/libsolv.so.1

No symbol table info available.

3 0x00007ffff2f924ef in repodata_internalize () from /lib64/libsolv.so.1

No symbol table info available.

4 0x00007ffff2d3f548 in repo_add_rpmmd () from /lib64/libsolvext.so.1

No symbol table info available.

5 0x00007ffff3dc2279 in load_filelists_cb(s_Repo, _IO_FILE) () from /lib64/libdnf.so.2

No symbol table info available.

6 0x00007ffff3dc4ddb in load_ext(_DnfSack, libdnf::Repo, _hy_repo_repodata, char const, char const, int ()(s_Repo, _IO_FILE*), _GError**) () from /lib64/libdnf.so.2

No symbol table info available.

7 0x00007ffff3dc5427 in dnf_sack_load_repo () from /lib64/libdnf.so.2

No symbol table info available.

8 0x00007fffe66d562e in load_repo(_SackObject, _object, _object*) () from /usr/lib64/python3.6/site-packages/hawkey/_hawkey.so

No symbol table info available.

9 0x00007ffff7537b84 in PyCFunction_Call () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

10 0x00007ffff754526f in _PyEval_EvalFrameDefault () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

11 0x00007ffff751b8d8 in fast_function () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

12 0x00007ffff753ec97 in call_function () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

13 0x00007ffff753f8e8 in _PyEval_EvalFrameDefault () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

14 0x00007ffff749c744 in _PyEval_EvalCodeWithName () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

15 0x00007ffff751bac0 in fast_function () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

16 0x00007ffff753ec97 in call_function () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

17 0x00007ffff754052a in _PyEval_EvalFrameDefault () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

18 0x00007ffff751b8d8 in fast_function () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

19 0x00007ffff753ec97 in call_function () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

20 0x00007ffff753f8e8 in _PyEval_EvalFrameDefault () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

21 0x00007ffff751b8d8 in fast_function () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

22 0x00007ffff753ec97 in call_function () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

23 0x00007ffff753f8e8 in _PyEval_EvalFrameDefault () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

24 0x00007ffff751b8d8 in fast_function () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

25 0x00007ffff753ec97 in call_function () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

26 0x00007ffff753f8e8 in _PyEval_EvalFrameDefault () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

27 0x00007ffff751b8d8 in fast_function () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

28 0x00007ffff753ec97 in call_function () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

29 0x00007ffff753f8e8 in _PyEval_EvalFrameDefault () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

30 0x00007ffff749c744 in _PyEval_EvalCodeWithName () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

31 0x00007ffff751bac0 in fast_function () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

32 0x00007ffff753ec97 in call_function () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

33 0x00007ffff753f8e8 in _PyEval_EvalFrameDefault () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

34 0x00007ffff749c744 in _PyEval_EvalCodeWithName () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

35 0x00007ffff751bac0 in fast_function () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

36 0x00007ffff753ec97 in call_function () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

37 0x00007ffff754052a in _PyEval_EvalFrameDefault () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

38 0x00007ffff749c744 in _PyEval_EvalCodeWithName () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

39 0x00007ffff755c593 in PyEval_EvalCode () from /lib64/libpython3.6m.so.1.0

No symbol table info available. --Type for more, q to quit, c to continue without paging--

40 0x00007ffff75aaa62 in run_mod () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

41 0x00007ffff747ce9c in PyRun_FileExFlags () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

42 0x00007ffff748209e in PyRun_SimpleFileExFlags () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

43 0x00007ffff7482918 in Py_Main.cold.3359 () from /lib64/libpython3.6m.so.1.0

No symbol table info available.

44 0x0000555555400b96 in main ()

No symbol table info available. That one crashes doing a memcpy() in the data_addblob inline:

(gdb) disass 0x00007ffff2f86784 Dump of assembler code for function data_addblob.isra.11: 0x00007ffff2f86730 <+0>: push r14 0x00007ffff2f86732 <+2>: mov r14,rdx 0x00007ffff2f86735 <+5>: push r13 0x00007ffff2f86737 <+7>: mov r13,rdi 0x00007ffff2f8673a <+10>: push r12 0x00007ffff2f8673c <+12>: movsxd r12,ecx 0x00007ffff2f8673f <+15>: push rbp 0x00007ffff2f86740 <+16>: mov rbp,r12 0x00007ffff2f86743 <+19>: push rbx 0x00007ffff2f86744 <+20>: mov rbx,rsi 0x00007ffff2f86747 <+23>: movsxd rax,DWORD PTR [rsi] 0x00007ffff2f8674a <+26>: mov rdi,QWORD PTR [rdi] 0x00007ffff2f8674d <+29>: cmp r12,0x1 0x00007ffff2f86751 <+33>: je 0x7ffff2f86790 <data_addblob.isra.11+96> 0x00007ffff2f86753 <+35>: lea rsi,[r12+rax*1] 0x00007ffff2f86757 <+39>: lea rcx,[rax-0x1] 0x00007ffff2f8675b <+43>: lea rdx,[rsi-0x1] 0x00007ffff2f8675f <+47>: or rcx,0x3ff 0x00007ffff2f86766 <+54>: or rdx,0x3ff 0x00007ffff2f8676d <+61>: cmp rcx,rdx 0x00007ffff2f86770 <+64>: jne 0x7ffff2f8679f <data_addblob.isra.11+111> 0x00007ffff2f86772 <+66>: mov QWORD PTR [r13+0x0],rdi 0x00007ffff2f86776 <+70>: mov rdx,r12 0x00007ffff2f86779 <+73>: mov rsi,r14 0x00007ffff2f8677c <+76>: add rdi,rax 0x00007ffff2f8677f <+79>: call 0x7ffff2f5b0b0 memcpy@plt => 0x00007ffff2f86784 <+84>: add DWORD PTR [rbx],ebp 0x00007ffff2f86786 <+86>: pop rbx 0x00007ffff2f86787 <+87>: pop rbp 0x00007ffff2f86788 <+88>: pop r12 0x00007ffff2f8678a <+90>: pop r13 0x00007ffff2f8678c <+92>: pop r14 0x00007ffff2f8678e <+94>: ret 0x00007ffff2f8678f <+95>: nop 0x00007ffff2f86790 <+96>: mov rdx,rax 0x00007ffff2f86793 <+99>: lea rsi,[rax+0x1] 0x00007ffff2f86797 <+103>: and edx,0x3ff 0x00007ffff2f8679d <+109>: jne 0x7ffff2f86772 <data_addblob.isra.11+66> 0x00007ffff2f8679f <+111>: mov ecx,0x3ff 0x00007ffff2f867a4 <+116>: mov edx,0x1 0x00007ffff2f867a9 <+121>: call 0x7ffff2f5a970 solv_extend_realloc@plt 0x00007ffff2f867ae <+126>: mov rdi,rax 0x00007ffff2f867b1 <+129>: movsxd rax,DWORD PTR [rbx] 0x00007ffff2f867b4 <+132>: jmp 0x7ffff2f86772 <data_addblob.isra.11+66>

These are the rpm versions for the 3 relevant rpms in case they matter:

2023-11-17 05:22:35PST [ user@repository-mirror002:~ ] $ rpm -qi libdnf Name : libdnf Version : 0.63.0 Release : 14.0.1.el8_8 Architecture: x86_64 Install Date: Tue 20 Jun 2023 08:49:36 AM PDT Group : Unspecified Size : 2417728 License : LGPLv2+ Signature : RSA/SHA256, Tue 16 May 2023 05:08:46 PM PDT, Key ID 82562ea9ad986da3 Source RPM : libdnf-0.63.0-14.0.1.el8_8.src.rpm Build Date : Tue 16 May 2023 05:06:01 PM PDT Build Host : build-ol8-x86_64.oracle.com Relocations : (not relocatable) Vendor : Oracle America URL : https://github.com/rpm-software-management/libdnf Summary : Library providing simplified C and Python API to libsolv Description : A Library providing simplified C and Python API to libsolv. 2023-11-17 05:22:47PST [ user@repository-mirror002:~ ] $ rpm -qi libsolv Name : libsolv Version : 0.7.20 Release : 4.el8_7 Architecture: x86_64 Install Date: Thu 16 Nov 2023 12:35:44 PM PST Group : Unspecified Size : 803747 License : BSD Signature : RSA/SHA256, Wed 14 Dec 2022 06:39:09 AM PST, Key ID 82562ea9ad986da3 Source RPM : libsolv-0.7.20-4.el8_7.src.rpm Build Date : Wed 14 Dec 2022 06:35:33 AM PST Build Host : build-ol8-x86_64.oracle.com Relocations : (not relocatable) Vendor : Oracle America URL : https://github.com/openSUSE/libsolv Summary : Package dependency solver Description : A free package dependency solver using a satisfiability algorithm. The library is based on two major, but independent, blocks:

mlschroe commented 7 months ago

Can you please also create a backtrace for the 0.7.26 version?

mlschroe commented 7 months ago

Can I access the repository with the big repodata so that I can reproduce the crash?

filsdepatrick commented 7 months ago

Unfortunately I'm unable to reproduce the error now as the source repo has been pruned down to about 6k packages, and the filelists.xml.gz file is now only 250MB in size. That seems to support my assumption that the error is related to the size of the metadata. The repo itself is an internal corporate repo with proprietary packages, so I'm not able to provide access to it.

filsdepatrick commented 7 months ago

I was able to reproduce the segmentation fault by creating a large repo from rpms from several of our devops upstream repos. The repo has over 13k rpms and a filelists.xml.gz file that is approx 510 MB I'm attaching the backtrace of the reposync process attempting to mirror this repo. This was run with libsolv-0.7.26

backtrace_with_libsolv-0.7.26.txt

filsdepatrick commented 7 months ago

I can provide the primary.xml.gz file that has the packages, sizes, provides and dependencies. Although I cannot give access to the source repository itself, the primary.xml file should help to create a repo that can be used to reproduce the error. The backtrace attached to the previous comment has the reposync command as we're running it at the top of the file.

2ab0ae584b7537d0f21de98c29468ecb1cbc3964a07fce0ca2820195cc92152c-primary.xml.gz

filsdepatrick commented 7 months ago

I can provide some additional test results from reproducing this issue in a lab using the test repo described above on an internal upstream mirror:

Summary: Starting state of test repo: 13279 rpms filelists.xml.gz 510 MB

I tested the occurance of segfault after reducing the size of the test repo by 1000 packages and regenerating the metadata on each iteration.  After 1 reduction, the segfault was still occurring on the downstream lab host when attempting to use reposync to mirror it.  After the second reduction of 1000 packages, the segfault no longer occurred.

-1000 packages segfault occurs (filelists.xml.gz 428 MB) -1000 packages no segfault (filelists.xml.gz 371 MB) To find the lower threshold, I added back the most recently eliminated packages by halves and observed:

+500 packages no segfault (filelists.xml.gz 389 MB) +250 packages no segfault (filelists.xml.gz 407 MB) +125 packages no segfault (filelists.xml.gz 419 MB) +68 packages no segfault (filelists.xml.gz 424 MB) +28 packages no segfault (filelists.xml.gz 426 MB) +29 (remaining packages to bring to same state as last segfault) segfault recurs (filelists.xml.gz 428 MB)

The segfault occurs when the filelists.xml.gz file reached a size of 428 MB, but did not occur when that metadata file was 426 MB in size.

The difference in the count of files and directories listed in the metadata file was 74501694 when the segfault occurred versus 74133397 when there was no segfault at the next smaller metadata file size.