sjshuck / hs-pcre2

Complete Haskell binding to PCRE2
Apache License 2.0
12 stars 2 forks source link

`Data.Text.Array.new: size overflow` #34

Open amesgen opened 1 year ago

amesgen commented 1 year ago

Sorry for this report being very non-minimally reproducible, please don't feel obligated to act on this in any way.

We are using hs-pcre2 to extract some information from the Hackage Hoogle database (500MB, 70k files): relevant function, and see this script for how to execute the app (this is Nix-based and hence fully reproducible, but it should be rather easily adaptable to not rely on Nix).

At some point (it still worked in May), the app started to crash after processing a few thousand files:

extract-hackage-info: Data.Text.Array.new: size overflow
CallStack (from HasCallStack):
  error, called at src/Data/Text/Array.hs:82:20 in text-1.2.5.0-B336hruu8LkEc19BqHCzAw:Data.Text.Array

This happens both with hs-pcre2 2.1.1.1 and 2.2.1 (so text-1.2 and text-2.0), but the error message for the latter goes like this:

extract-hackage-info: Out of memory

I replaced hs-pcre2 by megaparsec in https://github.com/tweag/ormolu/pull/958, which makes the error go away, so we are not blocked on this and I didn't investigate further what exactly is going on with hs-pcre2. But maybe this is still interesting to you or is ringing some bell.

sjshuck commented 1 year ago

Thanks for the report.

I have no intuition about what the issue is. Maybe dig into uses of Text.fromPtr, Text.copy, Text.dropWhile8, Text.useAsPtr, etc.

That's great you got use out of the library for Ormolu, at least for a little while! ❤️

sjshuck commented 1 year ago

https://github.com/haskell/text/blob/2.0.1/src/Data/Text/Array.hs#L75 Just a remark: It's funny that Data.Text.Array.new takes a signed Int for size, and calls it "overflow" when it's negative.

thielema commented 4 months ago

Can the bug be reproduced using valgrind?

Bodigrim commented 4 months ago

This happens both with hs-pcre2 2.1.1.1 and 2.2.1 (so text-1.2 and text-2.0), but the error message for the latter goes like this

The way to debug it is to build text with developer flag enabled. In order to force Cabal into doing so, pin a version of text which mismatches a boot package for GHC in use.

Also pass --ghc-options='-fcheck-prim-bounds -fno-ignore-asserts'.

Also you can fork text locally and start adding HasCallStack constraints to track the provenance of the error.

I'm curious if the error is still reproducible.

amesgen commented 4 months ago

I had another look, and I can still reproduce this (as described in the issue description; but I think it can be minimized somewhat, at the very least getting rid of the huge ghc-lib-parser dependency).

Building with profiling and +RTS -xc for stack traces as well as enabling the developer flag as suggested by @Bodigrim pointed to the following issue:

Stack trace:

   *** Exception (reporting due to +RTS -xc): (THUNK_2_0), stack trace:
     Data.Text.Array.CAF
     --> evaluated by: Text.Regex.Pcre2.Internal.getErrorMessage,
     called from Text.Regex.Pcre2.Internal.check,
     called from Text.Regex.Pcre2.Internal.fix1,
     called from Text.Regex.Pcre2.Internal.matcherWithEnv,
     called from Text.Regex.Pcre2.Internal.pureUserMatcher,
     called from Text.Regex.Pcre2.TH.memoMatcher,
     called from Text.Regex.Pcre2.Internal.mapMS,
     called from Text.Regex.Pcre2.Internal._gcaptures,
     called from Text.Regex.Pcre2.TH._capturesTH,
     called from Text.Regex.Pcre2.Internal.toAlternativeOf,
     called from Text.Regex.Pcre2.TH.capturesTH,
     called from Main.extractFixitiesFromFile,
     called from Main.extractHoogleInfo,
     called from Main.main

With a few more manually inserted putStrLns, it turns out that this is the offending line: https://github.com/sjshuck/hs-pcre2/blob/2d01498056569905abc643fdffa697efbbd6c08c/src/hs/Text/Regex/Pcre2/Internal.hs#L1257 Here, cus is -29, which corresponds to PCRE2_ERROR_BADDATA, which is returned by pcre2_get_error_message here https://github.com/sjshuck/hs-pcre2/blob/2d01498056569905abc643fdffa697efbbd6c08c/src/hs/Text/Regex/Pcre2/Internal.hs#L1256 because of this:

If errorcode does not contain a recognized error code number, the negative value PCRE2_ERROR_BADDATA is returned.

This in turn is because in matcherWithEnv here https://github.com/sjshuck/hs-pcre2/blob/2d01498056569905abc643fdffa697efbbd6c08c/src/hs/Text/Regex/Pcre2/Internal.hs#L719 we check whether result is positive, and throw an exception (which prints the error message) otherwise. Now the question is why result is non-positive. In fact, it is equal to 0, as a result of a call to pcre2_match: https://github.com/sjshuck/hs-pcre2/blob/2d01498056569905abc643fdffa697efbbd6c08c/src/hs/Text/Regex/Pcre2/Internal.hs#L705-L713 From the docs (emphasis mine):

The return from pcre2_match() is one more than the highest numbered capturing pair that has been set (for example, 1 if there are no captures), zero if the vector of offsets is too small, or a negative error code for no match and other errors.

This is where I am stopping for now, as I am not at all familiar with the PCRE2 C API.


The pattern in question is

(?m)^\\s*?(?<fixDir>infix[rl]?)\\s+?(?<fixPrec>[0-9])\\s+?(?<infixOpNames>(?:[^,\\s]+\\s*?,\\s*?)*?[^,\\s]+)\\s*$

in case that is useful; I have tried to produce a minimal example of matching this pattern against the subject text, but I couldn't reproduce the error there.


BTW:

In order to force Cabal into doing so, pin a version of text which mismatches a boot package for GHC in use.

A less hacky way to accomplish this is to use the following in your cabal.project:

constraints: text source
Bodigrim commented 4 months ago

Could it be a pcre2 bug then? Looking at https://www.pcre.org/current/doc/html/pcre2demo.html, it seems we are doing everything by the book: allocate offset vector with pcre2_match_data_create_from_pattern, then pass it to pcre2_match.

sjshuck commented 4 months ago

I updated the PCRE2 submodule from 10.42 to 10.44 just now. Wonder if that makes a difference. EDIT Not with running benchmarks, it doesn't. Still failing.

amesgen commented 4 months ago

Hmm, when I build with

constraints: text source
package text
  flags: +developer

and hs-pcre2 2.1.1 from Hackage, I get

extract-hackage-info: Data.Text.Array.new: size overflow
CallStack (from HasCallStack):
  error, called at src/Data/Text/Array.hs:74:19 in text-2.0.2-f5aebd26e19e5a1c2907d840502e1f413aa9a5057431546d415964b124f1ffd2:Data.Text.Array

If I instead build from recent master commits (tried 2d01498056569905abc643fdffa697efbbd6c08c and 8ac889a7e0e26564f663f1edd2bffce9bbcb6f65, so with PCRE2 10.44 and 10.42), I instead get

malloc(): unsorted double linked list corrupted
fish: Job 1, 'cabal run extract-hackage-info …' terminated by signal SIGABRT (Abort)

(seems similar to https://github.com/sjshuck/hs-pcre2/issues/35#issuecomment-1666963101) or sometimes also

malloc(): mismatching next->prev_size (unsorted)

Maybe the investigation I posted above is a red herring, and there is some nasty undefined behavior going on in pcre2 :thinking:

Bodigrim commented 4 months ago

https://stackoverflow.com/a/28762523 suggests that both malloc() failures mean memory corruption.