Open ghost opened 2 years ago
Thanks for the report.
Which version/commit of libpst are you running?
What is the exact command you are running in the two cases?
I'm unable to reproduce this with the sample files I have already.
If you could attach a sample without any private data in it that lets you reproduce the issue, that would be good.
Are you able to use git bisect
to track down when this started?
-- bye, pabs
It seems this bug doesn't depend on the version of libpst, but the version of debian. I built the most recent and version 0.6.71 on a debian 11 server and both versions had this behavior while building the same source on debian 10 worked fine. I can reproduce it with the minimal debian 11 container image from https://images.linuxcontainers.org/ and lxc. After:
apt update && apt install wget readpst
wget https://files.enlacehacktivista.org/Pronico%20mail/aBrezhnev.pst
readpst -D -M -b aBrezhnev.pst
readpst -D -M -b aBrezhnev.pst
readpst -D -M -b aBrezhnev.pst
readpst -j 0 -D -M -b aBrezhnev.pst
du -h --max-depth=1
I get:
419M ./aBrezhnev
408M ./aBrezhnev1
417M ./aBrezhnev2
436M ./aBrezhnev3
The last one with -j 0
is the largest, while the ones with parallel jobs miss some random amount of mail
I can reproduce it with the file you linked now.
Thanks for the debugging, that is a very strange bug.
Perhaps it is a compiler or optimsation issue.
I will try to figure out what is going on.
It definitely isn't affected by the optimisation level. Seems to happen with GCC 9-12 but not GCC 8 on Debian buster.
I'll continue looking at this tomorrow, but if you have any more ideas for debugging, that would be great.
Yea looks like this could be hard to debug, I "fixed" by just running with -j 0
and xargs -P
to parallelize. I tried building with clang on both servers and again it works fine on debian 10, not on 11. Of course it's different versions, llvm-7 vs llvm-11 but it doesn't seem likely that both newer clang and gcc would cause compiler issue. readpst also links a lot of libraries, maybe it's caused by a different version of some library between debian 10 and 11?
Looks like you are onto something, building the binary on buster and then copying it to bullseye breaks the parallel extraction even though it still runs fine.
Upgrading libgsf-1-114
didn't change anything, but upgrading libc6 libgd3 libglib2.0-0 libstdc++6
(and their deps) did. Their dependency versions prevented upgrading them individually.
I'm doing a debbisect run for this issue of unstable between the release date of buster and the release date of bullseye. This relies on downloads from the Debian wayback machine, which has rate limits, so the run could take a while to complete, but it will pinpoint the exact package difference that will cause the problem.
https://manpages.debian.org/man/debbisect https://wiki.debian.org/BisectDebian
-- bye, pabs
According to the debbisect process, upgrading libc6 from 2.28-10 to 2.29-1 broken the jobs feature.
I checked this issue on a Fedora 36 live image with glibc 2.35 and found that it still occurs there. I'll discuss it further with the Debian and upstream glibc maintainers.
Seems like a tricky bug, thanks for looking into it!
Yeah, the next step is to bisect glibc to figure out what commit caused the issue and then decide if libpst or glibc needs to be fixed. I got some build/bisect tips from the Debian glibc maintainers yesterday, so I will be looking at doing this some time this week.
-- bye, pabs
After a false start I managed to redo the bisect and got this bad commit:
0b727ed4d605d9318cb0d323c88abb0d5a441a9b is the first bad commit
commit 0b727ed4d605d9318cb0d323c88abb0d5a441a9b
Author: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Date: Thu Jul 13 16:43:24 2017 -0300
libio: Flush stream at freopen (BZ#21037)
As POSIX states [1] a freopen call should first flush the stream as if by a
call fflush. C99 (n1256) and C11 (n1570) only states the function should
first close any file associated with the specific stream. Although current
implementation only follow C specification, current BSD and other libc
implementation (musl) are in sync with POSIX and fflush the stream.
This patch change freopen{64} to fflush the stream before actually reopening
it (or returning if the stream does not support reopen). It also changes the
Linux implementation to avoid a dynamic allocation on 'fd_to_filename'.
Checked on x86_64-linux-gnu.
[BZ #21037]
* libio/Makefile (tests): Add tst-memstream4 and tst-wmemstream4.
* libio/freopen.c (freopen): Sync stream before reopen and adjust to
new fd_to_filename interface.
* libio/freopen64.c (freopen64): Likewise.
* libio/tst-memstream.h: New file.
* libio/tst-memstream4.c: Likewise.
* libio/tst-wmemstream4.c: Likewise.
* sysdeps/generic/fd_to_filename.h (fd_to_filename): Change signature.
* sysdeps/unix/sysv/linux/fd_to_filename.h (fd_to_filename): Likewise
and remove internal dynamic allocation.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/
:100644 100644 1fdd50a24c2629f064f8cba8d1e45814c12600d9 d0f7f5192973e5f4923f393695fd7ae88ad314ca M ChangeLog
:040000 040000 3ea0511dc410e70a011c68e8b5e6c06b298486d0 6e6c5b0412c6493cb1b437e9a2a8807f4b9aa346 M libio
:040000 040000 f510da053f81e8ac2aedff729918f0f4a8727b89 921bde8e25684e7d8a2dc8c8af1c9228451863a3 M sysdeps
bisect run success
This is the git bisect log, which you can replay to copy/visualise the results locally:
$ git bisect log
git bisect start
# good: [3c03baca37fdcb52c3881e653ca392bba7a99c2b] Update NEWS, version.h, and features.h for glibc 2.28.
git bisect good 3c03baca37fdcb52c3881e653ca392bba7a99c2b
# bad: [56c86f5dd516284558e106d04b92875d5b623b7a] Tag 2.29 release
git bisect bad 56c86f5dd516284558e106d04b92875d5b623b7a
# bad: [f6b3331bbae638d1bb50813fceb429d3b3dc0eb9] mktime: fix bug with Y2038 DST transition
git bisect bad f6b3331bbae638d1bb50813fceb429d3b3dc0eb9
# good: [d90c9b1a121295a5e31810b899ab637f68898857] Invert sense of list of i686-class processors in sysdeps/x86/cpu-features.h.
git bisect good d90c9b1a121295a5e31810b899ab637f68898857
# bad: [5b784e8834ae92961eb6be89863ffb041e9c6e75] Add new ELF note types from Linux 4.19 to elf.h.
git bisect bad 5b784e8834ae92961eb6be89863ffb041e9c6e75
# bad: [2a4b25fad858ccae361afca6d4710e0e6b01b2b5] Use common bits/shm.h for more architectures.
git bisect bad 2a4b25fad858ccae361afca6d4710e0e6b01b2b5
# good: [7b1f9406761331cf35fe521fbdb592beecf68a2c] i386: Use _dl_runtime_[resolve|profile]_shstk for SHSTK [BZ #23716]
git bisect good 7b1f9406761331cf35fe521fbdb592beecf68a2c
# bad: [e7624d708d12c608daf179d90c0d6de74c24dd2c] Adjust name of ld.so in test-container.c.
git bisect bad e7624d708d12c608daf179d90c0d6de74c24dd2c
# bad: [a1ccc0654b4e858022bed10f4fb5d5ab6df58b62] x86: Use RTM intrinsics in pthread mutex lock elision
git bisect bad a1ccc0654b4e858022bed10f4fb5d5ab6df58b62
# good: [d0e91e8c4ef77997c940538eaa752799c9dda2d3] Move MREMAP_* to bits/mman-shared.h.
git bisect good d0e91e8c4ef77997c940538eaa752799c9dda2d3
# bad: [0b727ed4d605d9318cb0d323c88abb0d5a441a9b] libio: Flush stream at freopen (BZ#21037)
git bisect bad 0b727ed4d605d9318cb0d323c88abb0d5a441a9b
# first bad commit: [0b727ed4d605d9318cb0d323c88abb0d5a441a9b] libio: Flush stream at freopen (BZ#21037)
I note that freopen
is used by pst_reopen
, which is used by try_fork
in readpst
. So I'm thinking that maybe freopen is being used incorrectly somehow.
I used a simple program to check that freopen always has reset the file position, so that isn't the problem and the issue seems more complicated than I thought.
I am now running the command with the debug log enabled, with and without -j0 in both Debian buster and bullseye, then I'll compare the logs and see if I can see what happened.
The builds have completed but weirdly the bullseye run didn't result in the error, maybe enabling the debug log fixes the issue somehow.
-- bye, pabs
Confirmed, dropping the debug log introduces the error again...
After enabling some commented out printf calls, I noticed that one of the readpst child processes crashed with SIGSEGV
, investigating.
Looks like maybe my printf patch caused the SIGSEGV.
Noticed another thing, only one folder seems affected and it seems the missing items have been skipped. @amulepeweichan can you confirm that you only ever get the issue with one folder and its the same folder as me?
$ readpst -j 0 -D -M -b aBrezhnev.pst
...
Processing Folder "Входящие"
"Входящие" - 926 items done, 18 items skipped.
...
$ readpst -D -M -b aBrezhnev.pst
...
"Входящие" - 896 items done, 48 items skipped.
...
$ diff -Naur <(cd aBrezhnev ; find | sort) <(cd aBrezhnev1 ; find | sort)
--- /dev/fd/63 2022-08-15 01:40:32.111623788 +0000
+++ /dev/fd/62 2022-08-15 01:40:32.111623788 +0000
@@ -904,40 +904,10 @@
./Входящие/894
./Входящие/895
./Входящие/896
-./Входящие/897
-./Входящие/898
-./Входящие/899
./Входящие/9
./Входящие/90
-./Входящие/900
-./Входящие/901
-./Входящие/902
-./Входящие/903
-./Входящие/904
-./Входящие/905
-./Входящие/906
-./Входящие/907
-./Входящие/908
-./Входящие/909
./Входящие/91
-./Входящие/910
-./Входящие/911
-./Входящие/912
-./Входящие/913
-./Входящие/914
-./Входящие/915
-./Входящие/916
-./Входящие/917
-./Входящие/918
-./Входящие/919
./Входящие/92
-./Входящие/920
-./Входящие/921
-./Входящие/922
-./Входящие/923
-./Входящие/924
-./Входящие/925
-./Входящие/926
./Входящие/93
./Входящие/94
./Входящие/95
OK, a second run got that folder affected and another one too, nevermind. The skip count increase is exactly the same as the missing item though.
These debug messages are printed for each extra skipped item:
libpst.c:1283: pst_parse_block() returned an error for d_ptr->desc->i_id [0x3bc4]
readpst.c:262: A NULL item was seen
The reasons for each unparsed item seem different. I am getting the feeling that the code is reading data from an incorrect randomised location.
libpst.c(1675) ERROR: Unknown block constant - 0x2df4 for id 0x2ae10
libpst.c(1579) Unknown second block constant - 0x3404 0x3004 for id 0x2d54c
libpst.c(1283) pst_parse_block() returned an error for d_ptr->desc->i_id [0x2d54c]
readpst.c(262) A NULL item was seen
libpst.c(1675) ERROR: Unknown block constant - 0x5f0e for id 0x2d84c
libpst.c(1675) ERROR: Unknown block constant - 0x6f for id 0xc400
libpst.c(1283) pst_parse_block() returned an error for d_ptr->desc->i_id [0xc400]
readpst.c(262) A NULL item was seen
libpst.c(1630) internal error (7c.b5 offset 0x20) in reading block id 0xfe06
libpst.c(1567) internal error (bc.b5 offset 0x20) in reading block id 0x11922
libpst.c(1283) pst_parse_block() returned an error for d_ptr->desc->i_id [0x11922]
readpst.c(262) A NULL item was seen
libpst.c(1675) ERROR: Unknown block constant - 0x3a for id 0x19bf0
libpst.c(1675) ERROR: Unknown block constant - 0x64 for id 0x19bf4
libpst.c(1675) ERROR: Unknown block constant - 0x41 for id 0x21ef0
libpst.c(1567) internal error (bc.b5 offset 0x20) in reading block id 0x2206a
libpst.c(1283) pst_parse_block() returned an error for d_ptr->desc->i_id [0x2206a]
readpst.c(262) A NULL item was seen
libpst.c(1567) internal error (bc.b5 offset 0x20) in reading block id 0x220be
libpst.c(1283) pst_parse_block() returned an error for d_ptr->desc->i_id [0x220be]
readpst.c(262) A NULL item was seen
libpst.c(1567) internal error (bc.b5 offset 0x20) in reading block id 0x220f2
libpst.c(1283) pst_parse_block() returned an error for d_ptr->desc->i_id [0x220f2]
readpst.c(262) A NULL item was seen
libpst.c(1567) internal error (bc.b5 offset 0x20) in reading block id 0x2219a
libpst.c(1283) pst_parse_block() returned an error for d_ptr->desc->i_id [0x2219a]
readpst.c(262) A NULL item was seen
libpst.c(1567) internal error (bc.b5 offset 0x20) in reading block id 0x238b8
libpst.c(1283) pst_parse_block() returned an error for d_ptr->desc->i_id [0x238b8]
readpst.c(262) A NULL item was seen
libpst.c(1675) ERROR: Unknown block constant - 0x1e41 for id 0x238d6
libpst.c(1283) pst_parse_block() returned an error for d_ptr->desc->i_id [0x238d6]
readpst.c(262) A NULL item was seen
libpst.c(1567) internal error (bc.b5 offset 0x20) in reading block id 0x23db6
libpst.c(1283) pst_parse_block() returned an error for d_ptr->desc->i_id [0x23db6]
readpst.c(262) A NULL item was seen
libpst.c(1567) internal error (bc.b5 offset 0x20) in reading block id 0x23e5e
libpst.c(1283) pst_parse_block() returned an error for d_ptr->desc->i_id [0x23e5e]
readpst.c(262) A NULL item was seen
libpst.c(1675) ERROR: Unknown block constant - 0x30 for id 0x23f3a
libpst.c(1283) pst_parse_block() returned an error for d_ptr->desc->i_id [0x23f3a]
readpst.c(262) A NULL item was seen
libpst.c(1567) internal error (bc.b5 offset 0x20) in reading block id 0x23fa6
libpst.c(1283) pst_parse_block() returned an error for d_ptr->desc->i_id [0x23fa6]
readpst.c(262) A NULL item was seen
libpst.c(1567) internal error (bc.b5 offset 0x20) in reading block id 0x2429a
libpst.c(1283) pst_parse_block() returned an error for d_ptr->desc->i_id [0x2429a]
readpst.c(262) A NULL item was seen
libpst.c(1567) internal error (bc.b5 offset 0x20) in reading block id 0x242ee
libpst.c(1283) pst_parse_block() returned an error for d_ptr->desc->i_id [0x242ee]
readpst.c(262) A NULL item was seen
libpst.c(1567) internal error (bc.b5 offset 0x20) in reading block id 0x2576a
libpst.c(1283) pst_parse_block() returned an error for d_ptr->desc->i_id [0x2576a]
readpst.c(262) A NULL item was seen
libpst.c(1675) ERROR: Unknown block constant - 0x64 for id 0x257c8
libpst.c(1675) ERROR: Unknown block constant - 0x6d for id 0x257cc
libpst.c(1567) internal error (bc.b5 offset 0x20) in reading block id 0x257f6
libpst.c(1283) pst_parse_block() returned an error for d_ptr->desc->i_id [0x257f6]
readpst.c(262) A NULL item was seen
libpst.c(1567) internal error (bc.b5 offset 0x20) in reading block id 0x25866
libpst.c(1283) pst_parse_block() returned an error for d_ptr->desc->i_id [0x25866]
readpst.c(262) A NULL item was seen
libpst.c(1567) internal error (bc.b5 offset 0x20) in reading block id 0x258d6
libpst.c(1283) pst_parse_block() returned an error for d_ptr->desc->i_id [0x258d6]
readpst.c(262) A NULL item was seen
libpst.c(1579) Unknown second block constant - 0x71 0x75 for id 0x25c62
libpst.c(1283) pst_parse_block() returned an error for d_ptr->desc->i_id [0x25c62]
readpst.c(262) A NULL item was seen
libpst.c(1567) internal error (bc.b5 offset 0x20) in reading block id 0x25c9a
libpst.c(1283) pst_parse_block() returned an error for d_ptr->desc->i_id [0x25c9a]
readpst.c(262) A NULL item was seen
libpst.c(1675) ERROR: Unknown block constant - 0x413a for id 0x25fa2
libpst.c(1283) pst_parse_block() returned an error for d_ptr->desc->i_id [0x25fa2]
readpst.c(262) A NULL item was seen
libpst.c(1586) internal error (bc.b5.desc offset 0x9e0) in reading block id 0x2607c
libpst.c(1283) pst_parse_block() returned an error for d_ptr->desc->i_id [0x2607c]
readpst.c(262) A NULL item was seen
libpst.c(1567) internal error (bc.b5 offset 0x20) in reading block id 0x26120
libpst.c(1283) pst_parse_block() returned an error for d_ptr->desc->i_id [0x26120]
readpst.c(262) A NULL item was seen
Aha, managed to reproduce it with a full debug log, so enabling that doesn't fix the issue. Good, that would have been weird :)
Looking at the full debug log, it seems to get completely different data and if I use hexed to search for the bad data, it is at 0x6AF321C, while the good data is at 0x6600, so my hunch about the read location being wrong seems correct.
Interestingly the issue doesn't occur with -j 1
.
With both -j 1
and -j 2
, the file position reported by ftell
is 0 after the call to freopen
, with both the good and bad versions of glibc.
@amulepeweichan are you able to reproduce this with any other PST files? The size of the logs with this one makes it challenging to analyse and debug. I tried with the smallest ones on that site and didn't get the error.
Yes, I think it happens with most of the PST files on https://files.enlacehacktivista.org/Pronico%20mail/ I noticed the bug when unpacking them all with readpst on one server resulted in 2.7 TB of email total, and on another server just 1.7 TB. And then on that server I ran readpst again with -j0 and got the same 2.7TB
On 8/16/22 23:11, Paul Wise wrote:
@amulepeweichan are you able to reproduce this with any other PST files? The size of the logs with this one makes it challenging to analyse and debug. I tried with the smallest ones on that site and didn't get the error.
Since this bug can cause data loss and is going to be very hard to debug and fix, I think I'm going to disable the -j option until I can figure out what exactly is going on.
-- bye, pabs
Edited the code to delete the output_mutex
but that did not help.
Same for the global_children
semaphore.
I can reproduce it with the file you linked now.
Thanks for the debugging, that is a very strange bug.
Perhaps it is a compiler or optimsation issue.
I will try to figure out what is going on.
-- bye, pabs
I can still reproduce this in Debian bookworm (GCC 12 glibc 2.35) and sid (GCC 12 glibc 2.36).
I discovered that switching from freopen()
to fclose()
and fopen()
fixes the issue.
I added some debugging to the pst_reopen function and none of the function calls including freopen are failing in the bad case.
-- bye, pabs
I discovered that switching from freopen()
to fseek(0, SEEK_SET)
also fixes the issue.
I discovered that switching from freopen() to fseek(0) does not fix the issue and can cause segfaults.
I discovered that switching from freopen() to fflush()+fclose()+fopen() also does fix the issue.
I now tried to extract mails from .pst by readpst (from Fedora 37 libpst-0.6.76-9.fc37.x86_64 package), and it seem behave strangely, because number of from .PST extracted messages varies, both when using different -e / -m / -M switches and even when using the same switch repeatedly(!). An example where I run the readpst 3 times in cycle with different switches and list the number of extracted files:
$ for i in 1 2 3; do for s in e m M; do
A="outlook.pst"; OD="${A}-$s-$i";
[ -d "$OD" ] && rm -rf "$OD"; mkdir "$OD";
echo -e "\n$A -> $OD/ :";
/usr/bin/time readpst -D -o "$OD" -$s -t e "$A" >"$OD.lst";
echo -n "No. of files under $OD/ : "; find $OD -type f -print|wc -l;
done; done
I got results:
outlook.pst -> outlook.pst-e-1/ :
8.91user 5.95system 0:01.49elapsed 992%CPU (0avgtext+0avgdata 20600maxresident)k
0inputs+5104768outputs (6major+712881minor)pagefaults 0swaps
No. of files under outlook.pst-e-1/ : 3293
outlook.pst -> outlook.pst-m-1/ :
12.75user 18.02system 0:57.17elapsed 53%CPU (0avgtext+0avgdata 21760maxresident)k
8inputs+15047672outputs (12major+1017676minor)pagefaults 0swaps
No. of files under outlook.pst-m-1/ : 6623
outlook.pst -> outlook.pst-M-1/ :
8.73user 5.92system 0:01.49elapsed 978%CPU (0avgtext+0avgdata 20444maxresident)k
0inputs+5142264outputs (3major+712103minor)pagefaults 0swaps
No. of files under outlook.pst-M-1/ : 3290
outlook.pst -> outlook.pst-e-2/ :
8.38user 5.16system 0:01.40elapsed 964%CPU (0avgtext+0avgdata 20548maxresident)k
0inputs+5133096outputs (6major+618609minor)pagefaults 0swaps
No. of files under outlook.pst-e-2/ : 3301
outlook.pst -> outlook.pst-m-2/ :
12.57user 17.95system 1:53.00elapsed 27%CPU (0avgtext+0avgdata 21832maxresident)k
80inputs+15450920outputs (25major+1001603minor)pagefaults 0swaps
No. of files under outlook.pst-m-2/ : 6645
outlook.pst -> outlook.pst-M-2/ :
8.74user 5.89system 0:01.57elapsed 928%CPU (0avgtext+0avgdata 20508maxresident)k
0inputs+5149504outputs (2major+681726minor)pagefaults 0swaps
No. of files under outlook.pst-M-2/ : 3295
outlook.pst -> outlook.pst-e-3/ :
8.62user 6.34system 0:02.18elapsed 686%CPU (0avgtext+0avgdata 20300maxresident)k
80inputs+5082768outputs (3major+693470minor)pagefaults 0swaps
No. of files under outlook.pst-e-3/ : 3291
outlook.pst -> outlook.pst-m-3/ :
13.13user 19.92system 2:15.04elapsed 24%CPU (0avgtext+0avgdata 21932maxresident)k
64inputs+15884072outputs (15major+1052849minor)pagefaults 0swaps
No. of files under outlook.pst-m-3/ : 6633
outlook.pst -> outlook.pst-M-3/ :
8.38user 6.01system 0:02.30elapsed 625%CPU (0avgtext+0avgdata 20332maxresident)k
88inputs+5162104outputs (3major+672984minor)pagefaults 0swaps
No. of files under outlook.pst-M-3/ : 3295
$ ls -l outlook.pst
-rw-r--r--. 1 jpraisova jpraisova 1950368768 Dec 16 2009 outlook.pst
Number of extracted messages digestedly in the switch/run_number matrix:
-e : 3293 3301 3291
-m : 6623 6645 6633
-M : 3290 3295 3295
Also, isn't it weird that the run time is around two seconds with -e and -M switch, but around two minutes with -m switch? Again the run time at a glance:
-e : 1.49 1.40 2.18
-m : 57.17 1:53.00 2:15.04
-M : 1.49 1.57 2.30
(it seems as readpst uses multiple CPU threads with -e/-M switch and only one with -m switch, but even then the differences are very big)
Also, according to readpst man page it appears that number of extracted files with '-m' should be double (.EML + .MSG) the number with -e/-M - but is bigger - maybe because there are some (maybe undocumented) files with 'temp_file_attachment' name. Is there any explanation for this?
Try it with parallel mode disabled -j 0
and it should work fine.
I haven't been able to figure out how to fix the parallel mode yet.
-- bye, pabs
Hi Paul, thanks for your help! The previously posted above listing of the weird readpst run results for switches -e, -m, and -M (and 3x each) now looks fine (switch-round, No.of exported files, 'time' outputs) :
e-1 : 3304 : 11.22user 10.32system 9:27.41elapsed 3%CPU (0avgtext+0avgdata 23196maxresident)k 3811000inputs+5162824outputs (41major+6067minor)pagefaults 0swaps
m-1 : 6681 : 6.06user 4.38system 0:24.76elapsed 42%CPU (0avgtext+0avgdata 24036maxresident)k 0inputs +12711808outputs (2major+7489minor)pagefaults 0swaps
M-1 : 3304 : 3.22user 1.28system 0:04.60elapsed 97%CPU (0avgtext+0avgdata 23676maxresident)k 0inputs +5162768outputs (0major+6101minor)pagefaults 0swaps
e-2 : 3304 : 3.57user 1.41system 0:10.42elapsed 47%CPU (0avgtext+0avgdata 23676maxresident)k 0inputs +5162744outputs (0major+6102minor)pagefaults 0swaps
m-2 : 6681 : 6.36user 4.56system 1:44.24elapsed 10%CPU (0avgtext+0avgdata 23968maxresident)k 0inputs +12711816outputs (0major+7532minor)pagefaults 0swaps
M-2 : 3304 : 3.13user 1.27system 0:04.42elapsed 99%CPU (0avgtext+0avgdata 23652maxresident)k 0inputs +5162768outputs (0major+6100minor)pagefaults 0swaps
e-3 : 3304 : 3.40user 1.33system 0:07.11elapsed 66%CPU (0avgtext+0avgdata 23676maxresident)k 0inputs +5162768outputs (0major+6099minor)pagefaults 0swaps
m-3 : 6681 : 5.46user 4.30system 1:39.38elapsed 9%CPU (0avgtext+0avgdata 24228maxresident)k 8inputs +12711824outputs (0major+7535minor)pagefaults 0swaps
M-3 : 3304 : 3.01user 1.28system 0:06.45elapsed 66%CPU (0avgtext+0avgdata 23676maxresident)k 0inputs +5162776outputs (0major+6102minor)pagefaults 0swaps
6681 extracted files (-m switch) == 2 * 3304 (-e or -M switch) + 73 'temp_file_attachment' files
There are still minor uncertainties:
Thanks, Franta
Can you file two new issues for these two items? They are off-topic on this issue.
I'm not entirely sure, but I think the temp_file_attachment things are a bug in the code that writes .msg files, it first creates temporary attachment files, then saves them into the .msg files but does not clean up the temporary attachment files. The correct fix here would be to make it write directly to the .msg files instead.
The difference in speed of the -m option is likely because it writes double the data (.eml files plus .msg files) and the .msg files are in Microsoft Outlook binary format and .msg files are created by the libgsf library, and libgsf probably isn't optimised very well.
-- bye, pabs
I revisited this bug using the strace stack trace suggestion from the glibc folks and found that this seems to be caused by readpst child processes doing a seek on a parent process file descriptor.
https://sourceware.org/pipermail/libc-help/2023-August/006430.html
The correct fix would appear to be to close the file and open it again.
Also, the fflush+fclose+fopen case actually fails the same as freopen.
If libc-help confirms I'll be making a commit and backporting it.
-- bye, pabs
I've pushed a workaround for this bug to git. After some other fixes I'll make a new release and also backport it to other releases. At some point the readpst parallel jobs feature will need to be redesigned, probably the new design will need to use mmap.
@fhanzlik FYI, added a change to delete those temporary files. Not writing them in the first place should be possible, but will require an incremental API for handling attachments, so I have deferred that.
When I run
readpst -M -b file.pst
it extracts less mail, and a slightly different amount each time, than when it's run with the -j 0 option. An older libpst 0.71 extracts all mail in parallel without -j 0 so maybe there's some concurrency bug that got introduced since then? I can send you pst files if you have trouble reproducing, but for me this happens on basically every .pst file.