iLCSoft / LCIO

Linear Collider I/O
BSD 3-Clause "New" or "Revised" License
17 stars 34 forks source link

Malloc problem #71

Open gianelle opened 4 years ago

gianelle commented 4 years ago

Hi, I'm using ILCSoft for muoncollider simulation/reconstruction. At the end of a long reconstruction I have this message (activating the VERBOSE flag in lcio):


A runtime error occured - (uncaught exception): lcio::IOException: [SIOWriter::writeEvent] couldn't write event record to stream: Output_REC_000_slcio0 Marlin will have to be terminated, sorry.


And in the log file:

[ VERBOSE "Output_REC"] SIO: [Output_REC_000_slcio0/LCEvent/] Allocated a 536870912(0x20000000) byte buffer [ VERBOSE "Output_REC"] SIO: [Output_REC_000_slcio0/LCEvent/] Allocated a 1073741824(0x40000000) byte buffer [ VERBOSE "Output_REC"] SIO: [Output_REC_000_slcio0/LCEvent/] Buffer allocation failed

So it seems that it is not able to allocate 2Gb of buffer.

I have tried to play with the ulimit command:

max memory size (kbytes, -m) unlimited stack size (kbytes, -s) 2202000 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited

But it doesn't help.

Do you have any suggestions?

rete commented 4 years ago

Hi, it looks like you are running out of RAM. Can you :

gianelle commented 4 years ago

Hi, these the first information

lsmem output:

RANGE                                  SIZE  STATE REMOVABLE   BLOCK
0x0000000000000000-0x0000000007ffffff  128M online        no       0
0x0000000008000000-0x0000000027ffffff  512M online       yes     1-4
0x0000000028000000-0x0000000037ffffff  256M online        no     5-6
0x0000000038000000-0x000000003fffffff  128M online       yes       7
0x0000000040000000-0x0000000047ffffff  128M online        no       8
0x0000000048000000-0x0000000057ffffff  256M online       yes    9-10
0x0000000058000000-0x000000005fffffff  128M online        no      11
0x0000000060000000-0x000000006fffffff  256M online       yes   12-13
0x0000000070000000-0x0000000077ffffff  128M online        no      14
0x0000000078000000-0x000000007fffffff  128M online       yes      15
0x0000000080000000-0x0000000087ffffff  128M online        no      16
0x0000000088000000-0x0000000097ffffff  256M online       yes   17-18
0x0000000098000000-0x000000009fffffff  128M online        no      19
0x00000000a0000000-0x00000000a7ffffff  128M online       yes      20
0x00000000a8000000-0x00000000afffffff  128M online        no      21
0x00000000b0000000-0x00000000b7ffffff  128M online       yes      22
0x00000000b8000000-0x00000000bfffffff  128M online        no      23
0x0000000100000000-0x000000010fffffff  256M online        no   32-33
0x0000000110000000-0x000000011fffffff  256M online       yes   34-35
0x0000000120000000-0x0000000127ffffff  128M online        no      36
0x0000000128000000-0x000000013fffffff  384M online       yes   37-39
0x0000000140000000-0x0000000157ffffff  384M online        no   40-42
0x0000000158000000-0x000000016fffffff  384M online       yes   43-45
0x0000000170000000-0x000000017fffffff  256M online        no   46-47
0x0000000180000000-0x0000000187ffffff  128M online       yes      48
0x0000000188000000-0x000000018fffffff  128M online        no      49
0x0000000190000000-0x000000019fffffff  256M online       yes   50-51
0x00000001a0000000-0x00000001afffffff  256M online        no   52-53
0x00000001b0000000-0x00000001b7ffffff  128M online       yes      54
0x00000001b8000000-0x00000001c7ffffff  256M online        no   55-56
0x00000001c8000000-0x00000001e7ffffff  512M online       yes   57-60
0x00000001e8000000-0x00000001f7ffffff  256M online        no   61-62
0x00000001f8000000-0x00000001ffffffff  128M online       yes      63
0x0000000200000000-0x0000000217ffffff  384M online        no   64-66
0x0000000218000000-0x000000021fffffff  128M online       yes      67
0x0000000220000000-0x0000000257ffffff  896M online        no   68-74
0x0000000258000000-0x000000025fffffff  128M online       yes      75
0x0000000260000000-0x0000000277ffffff  384M online        no   76-78
0x0000000278000000-0x000000028fffffff  384M online       yes   79-81
0x0000000290000000-0x000000029fffffff  256M online        no   82-83
0x00000002a0000000-0x00000002afffffff  256M online       yes   84-85
0x00000002b0000000-0x00000002bfffffff  256M online        no   86-87
0x00000002c0000000-0x000000031fffffff  1,5G online       yes   88-99
0x0000000320000000-0x0000000327ffffff  128M online        no     100
0x0000000328000000-0x0000000357ffffff  768M online       yes 101-106
0x0000000358000000-0x00000003a7ffffff  1,3G online        no 107-116
0x00000003a8000000-0x00000003b7ffffff  256M online       yes 117-118
0x00000003b8000000-0x00000003bfffffff  128M online        no     119
0x00000003c0000000-0x00000003c7ffffff  128M online       yes     120
0x00000003c8000000-0x00000003dfffffff  384M online        no 121-123
0x00000003e0000000-0x0000000407ffffff  640M online       yes 124-128
0x0000000408000000-0x000000040fffffff  128M online        no     129
0x0000000410000000-0x000000041fffffff  256M online       yes 130-131
0x0000000420000000-0x0000000427ffffff  128M online        no     132
0x0000000428000000-0x000000043fffffff  384M online       yes 133-135
0x0000000440000000-0x000000044fffffff  256M online        no 136-137
0x0000000450000000-0x0000000457ffffff  128M online       yes     138
0x0000000458000000-0x000000045fffffff  128M online        no     139
0x0000000460000000-0x0000000467ffffff  128M online       yes     140
0x0000000468000000-0x0000000477ffffff  256M online        no 141-142
0x0000000478000000-0x0000000487ffffff  256M online       yes 143-144
0x0000000488000000-0x000000048fffffff  128M online        no     145
0x0000000490000000-0x0000000497ffffff  128M online       yes     146
0x0000000498000000-0x00000004b7ffffff  512M online        no 147-150
0x00000004b8000000-0x00000004bfffffff  128M online       yes     151
0x00000004c0000000-0x00000004c7ffffff  128M online        no     152
0x00000004c8000000-0x00000004e7ffffff  512M online       yes 153-156
0x00000004e8000000-0x00000004efffffff  128M online        no     157
0x00000004f0000000-0x00000004f7ffffff  128M online       yes     158
0x00000004f8000000-0x000000052fffffff  896M online        no 159-165
0x0000000530000000-0x0000000537ffffff  128M online       yes     166
0x0000000538000000-0x000000053fffffff  128M online        no     167
0x0000000540000000-0x0000000547ffffff  128M online       yes     168
0x0000000548000000-0x0000000587ffffff    1G online        no 169-176
0x0000000588000000-0x0000000597ffffff  256M online       yes 177-178
0x0000000598000000-0x00000005e7ffffff  1,3G online        no 179-188
0x00000005e8000000-0x00000005efffffff  128M online       yes     189
0x00000005f0000000-0x00000005f7ffffff  128M online        no     190
0x00000005f8000000-0x0000000607ffffff  256M online       yes 191-192
0x0000000608000000-0x0000000617ffffff  256M online        no 193-194
0x0000000618000000-0x000000061fffffff  128M online       yes     195
0x0000000620000000-0x000000062fffffff  256M online        no 196-197
0x0000000630000000-0x0000000637ffffff  128M online       yes     198
0x0000000638000000-0x000000069fffffff  1,6G online        no 199-211
0x00000006a0000000-0x00000006a7ffffff  128M online       yes     212
0x00000006a8000000-0x00000006dfffffff  896M online        no 213-219
0x00000006e0000000-0x00000006e7ffffff  128M online       yes     220
0x00000006e8000000-0x00000006efffffff  128M online        no     221
0x00000006f0000000-0x000000071fffffff  768M online       yes 222-227
0x0000000720000000-0x0000000767ffffff  1,1G online        no 228-236
0x0000000768000000-0x0000000777ffffff  256M online       yes 237-238
0x0000000778000000-0x000000077fffffff  128M online        no     239
0x0000000780000000-0x000000078fffffff  256M online       yes 240-241
0x0000000790000000-0x00000007cfffffff    1G online        no 242-249
0x00000007d0000000-0x00000007d7ffffff  128M online       yes     250
0x00000007d8000000-0x000000083fffffff  1,6G online        no 251-263

Memory block size:       128M
Total online memory:      32G
Total offline memory:      0B

For what concern the anajob I think that you want to see the "size" on the input, I'm reconstructing a single signal event:

---------------------------------------------------------------------------
COLLECTION NAME               COLLECTION TYPE          NUMBER OF ELEMENTS  
===========================================================================
ECalBarrelCollection          SimCalorimeterHit             2054
ECalEndcapCollection          SimCalorimeterHit             1732
HCalBarrelCollection          SimCalorimeterHit             3372
HCalEndcapCollection          SimCalorimeterHit             3144
HCalRingCollection            SimCalorimeterHit               45
InnerTrackerBarrelCollection  SimTrackerHit                   71
InnerTrackerEndcapCollection  SimTrackerHit                   51
MCParticle                    MCParticle                     597
OuterTrackerBarrelCollection  SimTrackerHit                   91
OuterTrackerEndcapCollection  SimTrackerHit                   73
VertexBarrelCollection        SimTrackerHit                  137
VertexEndcapCollection        SimTrackerHit                   37
YokeBarrelCollection          SimCalorimeterHit                0
YokeEndcapCollection          SimCalorimeterHit                0
---------------------------------------------------------------------------

with 1000 bkg events like this ones:

---------------------------------------------------------------------------
COLLECTION NAME               COLLECTION TYPE          NUMBER OF ELEMENTS
===========================================================================
ECalBarrelCollection          SimCalorimeterHit             4098
ECalEndcapCollection          SimCalorimeterHit             1220
HCalBarrelCollection          SimCalorimeterHit              677
HCalEndcapCollection          SimCalorimeterHit             4342
HCalRingCollection            SimCalorimeterHit               62
InnerTrackerBarrelCollection  SimTrackerHit                  228
InnerTrackerEndcapCollection  SimTrackerHit                  178
MCParticle                    MCParticle                   25308
OuterTrackerBarrelCollection  SimTrackerHit                  483
OuterTrackerEndcapCollection  SimTrackerHit                  330
VertexBarrelCollection        SimTrackerHit                  386
VertexEndcapCollection        SimTrackerHit                  144
YokeBarrelCollection          SimCalorimeterHit                0
YokeEndcapCollection          SimCalorimeterHit                0
---------------------------------------------------------------------------

For the last bullet I need to wait ten hours, the time required by the reconstruction job, in any case I started also a script to monitor memory usage

Thanks for you help.

ale

rete commented 4 years ago

Hi,

overlaying 1000 times this amount of background is quite big. I guess you are reaching the limitations due to the OS on memory allocation by malloc. See: http://www.qnx.com/developers/docs/6.5.0SP1.update/com.qnx.doc.neutrino_lib_ref/m/malloc.html

In particular:

Because the malloc() implementation uses signed, 32-bit integers to represent the size internally, you can't allocate more than 2 GB in a single allocation. If the size is greater than 2 GB, malloc() indicates an error of ENOMEM.

If you know which event it is in particular, you can try to start your job at this event and before writing the event with the LCIOOutputProcessor, you could run a dumpEvent in the code (not dumpEventDetailed the output will be too long).

gaede commented 4 years ago

Another option might be to drop the MCParticle collection of the overlay events - this is what we do for the e+e- pair background overlay in ILC...

gianelle commented 4 years ago

This is the plot of the rss, the machine has 32Gb of RAM + 4Gb of swap.

mem

@rete I have also the suspect that there is a limitations due to the OS on memory allocation

@gaede I'm trying your option

The problem is that this is only a test, in production I'll have 1 signal event plus about 15k bkg events like the ones that I have posted, so probably I need a "more robust" workaround. In any case I let you know if the @gaede suggestion should be enough

thanks

andresailer commented 4 years ago

Is this still a problem? As the SIO package was re-written, maybe try with the latest version of LCIO/SIO?

rete commented 4 years ago

I guess it will give the same error. The implementation is similar in the new SIO.

andresailer commented 4 years ago

But as far as I can tell malloc is no longer called with an unsigned int (maximum 4 GB though) I can't even tell if malloc is called at all, but there are now size_t types used I think (you can probably confirm?)

The documentation you link above is only for a particular implementation, in principle allocation should work for larger blocks https://stackoverflow.com/questions/21132238/allocation-of-more-than-2gb-fails-on-64-bit-binary

But unsigned int was used before https://github.com/iLCSoft/LCIO/blob/05e5bb12e9baea08c6932b8652a17f932d96e736/sio/src/SIO_stream.cc#L483 https://github.com/iLCSoft/LCIO/blob/05e5bb12e9baea08c6932b8652a17f932d96e736/sio/include/SIO_stream.h#L130

So I hope we don't have to guess.

gianelle commented 4 years ago

Unfortunately this seems to be not enough to fix the problem