pixie16 / paass

Pixie Acquisition and Analysis Software Suite
https://pixie16.github.io/paassdoc/
GNU General Public License v3.0
10 stars 29 forks source link

Utkscan and scope exit at 14% while scanning a135feb_12.ldf #228

Closed sztaylor89 closed 7 years ago

sztaylor89 commented 7 years ago

When trying to run a135feb_12.ldf file, utkscan runs until 14%, then consumes all of the available computer memory, until ending with the message "killed".

GDB says,

Program terminated with signal SIGKILL, Killed. The Program no longer exists." The backtrace gives "no stack.

spaulaus commented 7 years ago

@sztaylor89

Can you provide us with more details on this error? What version are you running? What LDF are you analyzing. What does your configuration look like? What computer were you using?

The more information you can provide the better.

sztaylor89 commented 7 years ago

I'm on my ANL1471 branch from my forked repo. I last diverged from the dev branch at commit 976d769e904003234708ce6f40088f0ba76b05ee. The file I was using is on kqxhc under /scratch2/anl2015/FEB2015/135SB/a135feb_12.ldf

spaulaus commented 7 years ago

Have you rebased this branch onto dev recently?

sztaylor89 commented 7 years ago

Here's my config file(as txt) Config_135_121616.txt

sztaylor89 commented 7 years ago

I haven't rebased since the addition of the refactoring of the channel data

spaulaus commented 7 years ago

I would update the branch and see if that fixes your issue. I cleaned up a number of pointer issues that could have caused this behavior. This may also be related to Issue #199, but I don't have any specifics for that at the moment.

sztaylor89 commented 7 years ago

The file also causes scope to crash at 14%.

spaulaus commented 7 years ago

@sztaylor89 Have you managed to update?

spaulaus commented 7 years ago

I have tested this with 1f5c39ead06011a9275887ba87696f3869817c7f. The issue exists, and the program starts to have issues 14% complete. It used up all of the memory on my laptop and things became sluggish. The program didn't crash, but it utilized all of the RAM and swap space. This suggests that there is an issue with the memory allocation, as mentioned in #199.

I will close this issue report since it's now confirmed to be a duplicate and move the discussion to the other issue.

spaulaus commented 7 years ago

I have tested this with 1f5c39e using a135feb_12.ldf. The program starts to have issues 14% complete. It used up all of the memory on my laptop and things became sluggish. The program didn't crash, but it utilized all of the RAM and swap space. This suggests that there is an issue with the memory allocation.

spaulaus commented 7 years ago

Confirmed that this issue also occurs with utkscanor. This eliminates the issue being with the histogramming classes because utkscan and utkscanor use different histogramming.

Memory usage of the program: image

spaulaus commented 7 years ago

I have tried scanning IS599Oct_A052_02.ldf and we get much farther than the previous ldf with no obvious memory related issues. There may be some sort of corruption in a135feb_12.ldf that we are not handling properly.

spaulaus commented 7 years ago

From the damm histogram that was produced it looks like the file craps out about 400 seconds into the run.

tking53 commented 7 years ago

(copied over from #199 ) We also discovered that scope also starts eating memory ~14% as well. So i think wherever the code's issue with a135feb_12.ldf is; it's common to utkscan and scope. There is obviously something fishy about this file. Is there a way the unpacker (guessing) could catch this, stop and move on? It's also curious that pixie_ldf_c didn't see this issue (correct me if I'm wrong @sztaylor89 )

spaulaus commented 7 years ago

There's an option to fast-forward if you know how many words you want to skip. You can also use rejection regions if you know how much time to skip.

tking53 commented 7 years ago

i was thinking preemptive rather than reactive. but i guess without knowing exactly why the file is causing issues, that may not be possible.

spaulaus commented 7 years ago

We first have to identify exactly why it failed, then we can figure out how to recover from it.

ksmith0 commented 7 years ago

I would suggest trying to unpack the buffer headers with something like evtDump from evt2root v2 first and make sure it's not at the buffer level.

spaulaus commented 7 years ago

Do we have any codes internal to PAASS that can handle something like this?

ksmith0 commented 7 years ago

Not that I know of. @cthornsb may have something, but I'm not sure it is built into PAASS

spaulaus commented 7 years ago

@rin-yokoyama : Nice catch. I have tested this, and you are correct. That error boggled my mind for quite a while.