mwotton / judy

Other
18 stars 4 forks source link

memory issues #4

Open mwotton opened 8 years ago

mwotton commented 8 years ago

from @ketil-malde in mail

I was a bit baffled when I suddenly started receiving error messages running some jobs with my program (kmx). After all, I've tested this rather thorougly before, so I didn't expect any bugs.


kmx: Data.Judy: memory error with JudyL kmx: Data.Judy: memory error with JudyL kmx: Data.Judy: memory error with JudyL kmx: Data.Judy: memory error with JudyL kmx: Data.Judy: memory error with JudyL kmx: Data.Judy: memory error with JudyL kmx: Data.Judy: memory error with JudyL kmx: Data.Judy: memory error with JudyL kmx: Data.Judy: memory error with JudyL kmx: Data.Judy: memory error with JudyL kmx: Data.Judy: memory error with JudyL kmx: Data.Judy: memory error with JudyL kmx: Data.Judy: memory error with JudyL kmx: Data.Judy: memory error with JudyL kmx: Data.Judy: memory error with JudyL kmx: Data.Judy: memory error with JudyL malloc: failed on request for 72 bytes; message: newInCall kmx: Data.Judy: memory error with JudyL kmx: Data.Judy: memory error with JudyL

[12] exit 254 kmx count -k 32 $(ls reads/${a}.fastq) -o kmx/$a.32 rhea:..arkersFinal2015 % [5] exit 1 kmx count -k 32 $(ls reads/${a}.fastq) -o kmx/$a.32 rhea:..arkersFinal2015 % [19] exit 1 kmx count -k 32 $(ls reads/${a}.fastq) -o kmx/$a.32 rhea:..arkersFinal2015 %

[15] exit 1 kmx count -k 32 $(ls reads/${a}.fastq) -o kmx/$a.32

I first thought there must be a bug in 'insertWith' or 'toList' after all, but running the same commands on a different server with the exact same executable appeared to work just fine.

Now - clued in by the malloc failure above - I think this is caused by the overcommit policy (2 on the crashing server, 0 on the other one). Could it be that Judy here fails to check a malloc return value, and receives a memory error when trying to access unallocated memory?

(Very annoying, since the server has 640 gigs of RAM, and is hardly using any of it - but some &%¤# programmer probably relies on overcommit, and just allocates a huge heap, just in case...)

Anyway: this must probably be fixed over in C land, but I thought I'd mention it.

ketil-malde commented 7 years ago

Just ran into this again. According to previous discussion, this is caused by malloc returning zero, causing the judy C library to return PJERR (all ones), causing this library to throw an error. I now ran four sets of 32 processes, and three sets completed, while for the fourth, 30 of the processes crashed simultanously, and only two completed. Rerunning, the same two survived. If this was an out of memory issue, I'd expect it to be more stochastic.

I've no idea how to debug this any further, but just throwing this up here for the record.

mwotton commented 7 years ago

you do bring me the most interesting bugs :) i guess the next step would to be to write a fuzzer for Judy itself. I don't have time currently, but it's on my list.

ketil-malde commented 7 years ago

you do bring me the most interesting bugs :)

Ready for one more? I realize I'm probably doing something unwholesome, and don't expect you to fix things for me. But perhaps you can indulge me in the mental excercise of writing things down in a structured manner? Thanks!

So: I have a program that indexes k-mers (fixed-size words) as Word64's, storing the different ones along with their counts in a Judy array, and at the end, writing to a file. So an associative map mapping
k-word -> count.

Then, the second stage re-reads this file, and builds a histogram, tallying how often each count occurs. let's call this count -> frequency

For some time, this just seemed to work. But then I implemented an expectation-maximization algorithm to model the frequency distributions, and that failed to terminate in some cases. Going back, I found the following histogram to cause the trouble:

==> SRR901891_1.32.hist <== 865 809 866 775 867 757 868 781 869 725 870 778 871 795 872 774 873 771 48036128 2017612633061982208

Notice that last value? Apparently, there are 2.018e18 different k-mers with a frequency of 48036128. This is clearly not right.

Notice also that the latter number is 0x01c000...000. Does this carry any special meaning?

One possibility is of course some garbage in the k-mer count index, but I also have a dump command that prints the contents as text. I can find nothing wrong with that, and the highest count k-mers look like this:

% sort -nr -k2 dump | head
tgggggtccagccatggagaagagtttagaca 1670602 gggggtccagccatggagaagagtttagacac 1653146 gtctaaactcttctccatggctggacccccaa 1638158 gaatattatttgggggtccagccatggagaag 1567591 ggaatattatttgggggtccagccatggagaa 1539378 actagtgcttaggaaatctattggaggcagaa 1532596 gcttctgcctccaatagatttcctaagcacta 1531091 aggaatattatttgggggtccagccatggaga 1528806 ctagtgcttaggaaatctattggaggcagaag 1527682 aaaggaatattatttgggggtccagccatgga 1518392

One thing to look into is how the histogram is built. I have wrapped this in a function (mk_judy, code below) that returns a struct of operations, called FreqCount. mk_judy takes as an argument the number of bits to use in the index - but this isn't really used for anything (JudyL is JudyL). Yet, when I changed this parameter from 16 to 32, the last line changed to:

17180896 2017612633061982208

And with 64:

37271776 2017612633061982208

The latter number is always the same, the former ends with 0x8e0, but appears to vary in the most significant bits.

Everything else is the same in the histograms, I did a diff on the outputs.

-- Code ---------------------------------------- mk_judy :: Int -> IO FreqCount mk_judy l = do -- putStrLn ("Init judy "++show l) j <- J.new :: IO (J.JudyL Int) let ac k = k seq J.insertWith (+) k 1 j gc k = k seq fromMaybe 0 fmap J.lookup k j sc k v = J.insert k v j es = J.unsafeFreeze j >>= J.elems ks = J.unsafeFreeze j >>= J.keys as = J.unsafeFreeze j >>= J.toList return $ FreqCount { key_bits = l , add_count = ac, get_count = gc, set_count = sc , keys = ks, counts = es, assocs = as } -- Code ----------------------------------------

If we're sure there is nothing wrong with the Judy library, the only other thing I can think of is the unsafeFreeze. I'm pretty sure I don't modify the Judy array after accessing it through the "frozen" functions

33769696 2017612633061982208

Again, 0x...8e0 and the usual 0x01c0...

Anyway, that's where it stands at the moment. I'm not sure what to do, it does worry me that this counting seems to fail on occasion. Note also that the histogram gets truncated (always at the same place, after 873), and it should really go up to 1.5 million.

I'll run some more tests and see if there is some pattern I can discover. Just thought I'd share in case anything in the above gives you some ideas.

-k -- If I haven't seen further, it is by standing in the footprints of giants