Linux Segfaults - Githubissues

BEFH commented 6 years ago

We are getting segfaults on some nodes of our cluster, but not others when running several pandoc versions:

Our failing nodes all have one of the following processors:

Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz

However, it consistently segfaults on some of these nodes but not on others.

It does not segfault on our AMD Opteron(TM) Processor 6276s.

Kernel versions are as follows: 2.6.32-358.23.2.el6.x86_64, 2.6.32-696.18.7.el6.x86_64 2.6.32-358.23.2.el6.x86_64 2.6.32-358.6.2.el6.x86_64

All nodes are on CentOS 6.2

This occurs with both the system pandoc (1.19.2.1), a modulecmd version (2.0), and the statically linked, pre-compiled version 2.1.3.

↪ pandoc --version
fish: 'pandoc --version' terminated by signal SIGSEGV (Address boundary error)

It also segfaults in bash:

[fultob01@interactive5 ~]$ pandoc --version
Segmentation fault

With my statically linked pandoc, I get a really nice strace. It looks like pandoc is trying to write to 0x4200000000, which is out of bounds, and bode allows the write but shouldn't, so pandoc segfaults when it attempts to read. I have no idea what the solution is for this, but for now, I'll use mothra or manda. Do you have any idea why bode is allowing pandoc to write to that address?

Here's the strace trace:

execve("/hpc/users/fultob01/local/bin/pandoc", ["pandoc"], [/* 54 vars */]) = 0
arch_prctl(ARCH_SET_FS, 0x49f0420)      = 0
set_tid_address(0x49f0458)              = 132178
brk(0)                                  = 0x6501000
brk(0x6502000)                          = 0x6502000
brk(0x6505000)                          = 0x6505000
brk(0x6508000)                          = 0x6508000
getrusage(RUSAGE_SELF, {ru_utime={0, 0}, ru_stime={0, 0}, ...}) = 0
sysinfo({uptime=1493582, loads=[17088, 8768, 6176] totalram=67442647040, freeram=27952783360, sharedram=0, bufferram=101777408} t
brk(0x6519000)                          = 0x6519000
mmap(0x4200000000, 1099512676352, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x4200000000
mmap(0x4200000000, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
--- SIGSEGV (Segmentation fault) @ 0 (0) ---

On one of our working nodes, it looks like this:

execve("/hpc/users/fultob01/local/bin/pandoc", ["pandoc"], [/* 79 vars */]) = 0
arch_prctl(ARCH_SET_FS, 0x49f0420)      = 0
set_tid_address(0x49f0458)              = 113282
brk(0)                                  = 0x5e74000
brk(0x5e75000)                          = 0x5e75000
brk(0x5e78000)                          = 0x5e78000
brk(0x5e7b000)                          = 0x5e7b000
getrusage(RUSAGE_SELF, {ru_utime={0, 0}, ru_stime={0, 1999}, ...}) = 0
sysinfo({uptime=5121149, loads=[10304, 5024, 768] totalram=67440967680, freeram=43350913024, sharedram=0, bufferram=224522240} to
brk(0x5e8c000)                          = 0x5e8c000
mmap(0x4200000000, 1099512676352, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(0x4200001000, 549756862464, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(0x4200002000, 274878955520, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(0x4200003000, 137440002048, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(0x4200004000, 68720525312, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(0x4200005000, 34360786944, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x4200005000
munmap(0x4200005000, 1028096)           = 0

mb21 commented 6 years ago

What OS are you on? Can you try to compile from source on the exact OS you're trying to run it on?

If you're on Fedora, this is probably a duplicate of https://github.com/jgm/pandoc/issues/4461

BEFH commented 6 years ago

I'm on CentOS which is related to fedora. I will coordinate with my cluster people.

On Tue, Mar 27, 2018 at 11:34 AM Mauro Bieg notifications@github.com wrote:

What OS are you on? Can you try to compile from source https://github.com/jgm/pandoc/blob/master/INSTALL.md#compiling-from-source on the exact OS you're trying to run it on?

If you're on Fedora, this is probably a duplicate of #4461 https://github.com/jgm/pandoc/issues/4461

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jgm/pandoc/issues/4504#issuecomment-376569997, or mute the thread https://github.com/notifications/unsubscribe-auth/ADOs6AYrY3D-c2Rq03Q5JLLKgRt6XAN-ks5tilvsgaJpZM4S9DY4 .

jgm commented 6 years ago

I agree, it would be helpful to know if the problem persists with pandoc compiled on the target system. On the systems where you get the segfault, what causes it? Does every pandoc command cause it? You mention pandoc --version. Do you also get the segfault with, say, a simple conversion? Does it matter whether you use -s? People have had similar problems on Windows 7; see #4283.

BEFH commented 6 years ago

Could you please suggest some simple commands with the files to use? It segfaults even if I run pandoc with no arguments, and it segfaults from rmarkdown.

On a related note, I'm getting an "error 139" on the working nodes for some files. The command run is as follows:

/sc/orga/projects/LOAD/Brian/anaconda3/bin/pandoc +RTS -K512m -RTS Post_imputation.utf8.md --to html --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash --output /sc/orga/projects/LOAD/Brian/projects/ADSP/data/pre_impute_merge/impute_stats/stats/CHARGE_CHS_impStats.html --smart --email-obfuscation none --self-contained --standalone --section-divs --template /hpc/packages/minerva-common/rpackages/3.4.3/site-library/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable 'theme:bootstrap' --include-in-header /tmp/101010540.tmpdir/RtmpomNOqg/rmarkdown-str799f60520417.html --mathjax --variable 'mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'

I couldn't find error 139 anywhere in the source, but I have read that it might be related to segfaults.

jgm commented 6 years ago

Brian Fulton-Howard notifications@github.com writes:

Could you please suggest some simple commands with the files to use? It segfaults even if I run pandoc with no arguments, and it segfaults from rmarkdown.

I was thinking of something like

echo "Hello" | pandoc

On a related note, I'm getting an "error 139" on the working nodes for some files. The command run is as follows:

/sc/orga/projects/LOAD/Brian/anaconda3/bin/pandoc +RTS -K512m -RTS Post_imputation.utf8.md --to html --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash --output /sc/orga/projects/LOAD/Brian/projects/ADSP/data/pre_impute_merge/impute_stats/stats/CHARGE_CHS_impStats.html --smart --email-obfuscation none --self-contained --standalone --section-divs --template /hpc/packages/minerva-common/rpackages/3.4.3/site-library/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable 'theme:bootstrap' --include-in-header /tmp/101010540.tmpdir/RtmpomNOqg/rmarkdown-str799f60520417.html --mathjax --variable 'mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'

I couldn't find error 139 anywhere in the source, but I have read that it might be related to segfaults.

We don't use exit code 139 for anything. You might try simplifying your command line piece by piece to see if you can isolate something that's correlated with the problem. I assume these nodes have enough memory to use 512m of stack space (that's what +RTS -K512m -RTS calls for)? One thing to try is increasing or decreasing this.

BEFH commented 6 years ago

So I (with the help of a dedicated sysadmin) have confirmed that when pandoc segfaults immediately, it does so regardless of the simplicity of the command.

I have also confirmed that error 139 is added by GHC and means there is a segfault. Error 139 occurs with some files on the Intel processors that don't immediately segfault. It does not occur on our ancient, slow AMD processors.

Our AMD servers have 256 GB of memory, and the Intel servers have 64 GB.

We installing Haskel Platform 8.2.2 and compiling the latest pandoc from source, as you requested. We will let you know the results when we manage to compile.

jgm commented 6 years ago

Brian Fulton-Howard notifications@github.com writes:

So I (with the help of a dedicated sysadmin) have confirmed that when pandoc segfaults immediately, it does so regardless of the simplicity of the command.

I have also confirmed that error 139 is added by GHC and means there is a segfault. Error 139 occurs with some files on the Intel processors that don't immediately segfault. It does not occur on our ancient, slow AMD processors.

Our AMD servers have 256 GB of memory, and the Intel servers have 64 GB.

We installing Haskel Platform 8.2.2 and compiling the latest pandoc from source, as you requested. We will let you know the results when we manage to compile.

Great. My prediction is that the natively compiled version will work. If it still segfaults, then very likely this points to a bug in GHC.

BEFH commented 6 years ago

After native compilation, it no longer segfaults. Instead, this:

pandoc: internal error: Unable to commit 1048576 bytes of memory
    (GHC version 8.2.2 for x86_64_unknown_linux)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug

jgm commented 6 years ago

Very strange!

You might try compiling with a different version of ghc, such as the latest, ghc 8.4.1. (Linux binaries are available.)

If that doesn't fix things, reporting as a GHC bug would be appreciated, I'm sure.

Brian Fulton-Howard notifications@github.com writes:

After native compilation, it no longer segfaults. Instead, this:
pandoc: internal error: Unable to commit 1048576 bytes of memory
    (GHC version 8.2.2 for x86_64_unknown_linux)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug
-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/jgm/pandoc/issues/4504#issuecomment-376917370

nh2 commented 6 years ago

Try run it with gdb --args and see if there's a backtrace from some C code (if you're unlucky, there is none, and it happens straight from Haskell, but if you're lucky, there's some C code in between).

jgm commented 5 years ago

Relevant ghc ticket: https://ghc.haskell.org/trac/ghc/ticket/15054 Looks like a bug still not fixed in ghc 8.6.1.

billglick commented 4 years ago

Are there any known work arounds to prevent the issue?

I'm running into it on several VMs running RHEL 6 with various memory footprints (24GB, 48GB, 52GB, 64GB, etc.) with 50% or more free memory. I can't figure out why it Cannot allocate memory on some, but works fine on others.

mahermassoud commented 3 years ago

Having the same issue. Even when i just run pandoc cli command

$ pandoc
pandoc: internal error: Unable to commit 1048576 bytes of memory
    (GHC version 8.10.1 for x86_64_unknown_linux)
    Please report this as a GHC bug:  https://www.haskell.org/ghc/reportabug
Aborted (core dumped)

jgm commented 3 years ago

@mahermassoud please give more information: How exactly you installed pandoc, what version, what architecture and OS you're using.

I note that the ghc issue linked above is still open.

mahermassoud commented 3 years ago

@jgm I believe I'm on redhat because yum is installed

$ uname -a
Linux polaris.pbtech 2.6.32-642.11.1.el6.x86_64 #1 SMP Fri Nov 18 19:25:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

I'm in a totally new conda environment where i installed r and r studio using conda install rstudio

No idea which version it is.

Let me know if you need more info

I got a similar issue installing with pip install pandoc

jgm commented 3 years ago

pip install pandoc installs a python library, not the pandoc executable. Sorry, I can't help more without knowing more details. I have no idea what rstudio is doing to install pandoc, how their pandoc is compiled, etc.

jgm commented 1 year ago

Closing ; the upstream ghc issues has been fixed for a long time.

jgm / pandoc

Linux Segfaults #4504