Open gianelle opened 4 years ago
Hi, it looks like you are running out of RAM. Can you :
lsmem
and post the result here.anajob file.slcio
and post the content of the first event.htop
/ top
.Hi, these the first information
lsmem output:
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000000000000-0x0000000007ffffff 128M online no 0
0x0000000008000000-0x0000000027ffffff 512M online yes 1-4
0x0000000028000000-0x0000000037ffffff 256M online no 5-6
0x0000000038000000-0x000000003fffffff 128M online yes 7
0x0000000040000000-0x0000000047ffffff 128M online no 8
0x0000000048000000-0x0000000057ffffff 256M online yes 9-10
0x0000000058000000-0x000000005fffffff 128M online no 11
0x0000000060000000-0x000000006fffffff 256M online yes 12-13
0x0000000070000000-0x0000000077ffffff 128M online no 14
0x0000000078000000-0x000000007fffffff 128M online yes 15
0x0000000080000000-0x0000000087ffffff 128M online no 16
0x0000000088000000-0x0000000097ffffff 256M online yes 17-18
0x0000000098000000-0x000000009fffffff 128M online no 19
0x00000000a0000000-0x00000000a7ffffff 128M online yes 20
0x00000000a8000000-0x00000000afffffff 128M online no 21
0x00000000b0000000-0x00000000b7ffffff 128M online yes 22
0x00000000b8000000-0x00000000bfffffff 128M online no 23
0x0000000100000000-0x000000010fffffff 256M online no 32-33
0x0000000110000000-0x000000011fffffff 256M online yes 34-35
0x0000000120000000-0x0000000127ffffff 128M online no 36
0x0000000128000000-0x000000013fffffff 384M online yes 37-39
0x0000000140000000-0x0000000157ffffff 384M online no 40-42
0x0000000158000000-0x000000016fffffff 384M online yes 43-45
0x0000000170000000-0x000000017fffffff 256M online no 46-47
0x0000000180000000-0x0000000187ffffff 128M online yes 48
0x0000000188000000-0x000000018fffffff 128M online no 49
0x0000000190000000-0x000000019fffffff 256M online yes 50-51
0x00000001a0000000-0x00000001afffffff 256M online no 52-53
0x00000001b0000000-0x00000001b7ffffff 128M online yes 54
0x00000001b8000000-0x00000001c7ffffff 256M online no 55-56
0x00000001c8000000-0x00000001e7ffffff 512M online yes 57-60
0x00000001e8000000-0x00000001f7ffffff 256M online no 61-62
0x00000001f8000000-0x00000001ffffffff 128M online yes 63
0x0000000200000000-0x0000000217ffffff 384M online no 64-66
0x0000000218000000-0x000000021fffffff 128M online yes 67
0x0000000220000000-0x0000000257ffffff 896M online no 68-74
0x0000000258000000-0x000000025fffffff 128M online yes 75
0x0000000260000000-0x0000000277ffffff 384M online no 76-78
0x0000000278000000-0x000000028fffffff 384M online yes 79-81
0x0000000290000000-0x000000029fffffff 256M online no 82-83
0x00000002a0000000-0x00000002afffffff 256M online yes 84-85
0x00000002b0000000-0x00000002bfffffff 256M online no 86-87
0x00000002c0000000-0x000000031fffffff 1,5G online yes 88-99
0x0000000320000000-0x0000000327ffffff 128M online no 100
0x0000000328000000-0x0000000357ffffff 768M online yes 101-106
0x0000000358000000-0x00000003a7ffffff 1,3G online no 107-116
0x00000003a8000000-0x00000003b7ffffff 256M online yes 117-118
0x00000003b8000000-0x00000003bfffffff 128M online no 119
0x00000003c0000000-0x00000003c7ffffff 128M online yes 120
0x00000003c8000000-0x00000003dfffffff 384M online no 121-123
0x00000003e0000000-0x0000000407ffffff 640M online yes 124-128
0x0000000408000000-0x000000040fffffff 128M online no 129
0x0000000410000000-0x000000041fffffff 256M online yes 130-131
0x0000000420000000-0x0000000427ffffff 128M online no 132
0x0000000428000000-0x000000043fffffff 384M online yes 133-135
0x0000000440000000-0x000000044fffffff 256M online no 136-137
0x0000000450000000-0x0000000457ffffff 128M online yes 138
0x0000000458000000-0x000000045fffffff 128M online no 139
0x0000000460000000-0x0000000467ffffff 128M online yes 140
0x0000000468000000-0x0000000477ffffff 256M online no 141-142
0x0000000478000000-0x0000000487ffffff 256M online yes 143-144
0x0000000488000000-0x000000048fffffff 128M online no 145
0x0000000490000000-0x0000000497ffffff 128M online yes 146
0x0000000498000000-0x00000004b7ffffff 512M online no 147-150
0x00000004b8000000-0x00000004bfffffff 128M online yes 151
0x00000004c0000000-0x00000004c7ffffff 128M online no 152
0x00000004c8000000-0x00000004e7ffffff 512M online yes 153-156
0x00000004e8000000-0x00000004efffffff 128M online no 157
0x00000004f0000000-0x00000004f7ffffff 128M online yes 158
0x00000004f8000000-0x000000052fffffff 896M online no 159-165
0x0000000530000000-0x0000000537ffffff 128M online yes 166
0x0000000538000000-0x000000053fffffff 128M online no 167
0x0000000540000000-0x0000000547ffffff 128M online yes 168
0x0000000548000000-0x0000000587ffffff 1G online no 169-176
0x0000000588000000-0x0000000597ffffff 256M online yes 177-178
0x0000000598000000-0x00000005e7ffffff 1,3G online no 179-188
0x00000005e8000000-0x00000005efffffff 128M online yes 189
0x00000005f0000000-0x00000005f7ffffff 128M online no 190
0x00000005f8000000-0x0000000607ffffff 256M online yes 191-192
0x0000000608000000-0x0000000617ffffff 256M online no 193-194
0x0000000618000000-0x000000061fffffff 128M online yes 195
0x0000000620000000-0x000000062fffffff 256M online no 196-197
0x0000000630000000-0x0000000637ffffff 128M online yes 198
0x0000000638000000-0x000000069fffffff 1,6G online no 199-211
0x00000006a0000000-0x00000006a7ffffff 128M online yes 212
0x00000006a8000000-0x00000006dfffffff 896M online no 213-219
0x00000006e0000000-0x00000006e7ffffff 128M online yes 220
0x00000006e8000000-0x00000006efffffff 128M online no 221
0x00000006f0000000-0x000000071fffffff 768M online yes 222-227
0x0000000720000000-0x0000000767ffffff 1,1G online no 228-236
0x0000000768000000-0x0000000777ffffff 256M online yes 237-238
0x0000000778000000-0x000000077fffffff 128M online no 239
0x0000000780000000-0x000000078fffffff 256M online yes 240-241
0x0000000790000000-0x00000007cfffffff 1G online no 242-249
0x00000007d0000000-0x00000007d7ffffff 128M online yes 250
0x00000007d8000000-0x000000083fffffff 1,6G online no 251-263
Memory block size: 128M
Total online memory: 32G
Total offline memory: 0B
For what concern the anajob I think that you want to see the "size" on the input, I'm reconstructing a single signal event:
---------------------------------------------------------------------------
COLLECTION NAME COLLECTION TYPE NUMBER OF ELEMENTS
===========================================================================
ECalBarrelCollection SimCalorimeterHit 2054
ECalEndcapCollection SimCalorimeterHit 1732
HCalBarrelCollection SimCalorimeterHit 3372
HCalEndcapCollection SimCalorimeterHit 3144
HCalRingCollection SimCalorimeterHit 45
InnerTrackerBarrelCollection SimTrackerHit 71
InnerTrackerEndcapCollection SimTrackerHit 51
MCParticle MCParticle 597
OuterTrackerBarrelCollection SimTrackerHit 91
OuterTrackerEndcapCollection SimTrackerHit 73
VertexBarrelCollection SimTrackerHit 137
VertexEndcapCollection SimTrackerHit 37
YokeBarrelCollection SimCalorimeterHit 0
YokeEndcapCollection SimCalorimeterHit 0
---------------------------------------------------------------------------
with 1000 bkg events like this ones:
---------------------------------------------------------------------------
COLLECTION NAME COLLECTION TYPE NUMBER OF ELEMENTS
===========================================================================
ECalBarrelCollection SimCalorimeterHit 4098
ECalEndcapCollection SimCalorimeterHit 1220
HCalBarrelCollection SimCalorimeterHit 677
HCalEndcapCollection SimCalorimeterHit 4342
HCalRingCollection SimCalorimeterHit 62
InnerTrackerBarrelCollection SimTrackerHit 228
InnerTrackerEndcapCollection SimTrackerHit 178
MCParticle MCParticle 25308
OuterTrackerBarrelCollection SimTrackerHit 483
OuterTrackerEndcapCollection SimTrackerHit 330
VertexBarrelCollection SimTrackerHit 386
VertexEndcapCollection SimTrackerHit 144
YokeBarrelCollection SimCalorimeterHit 0
YokeEndcapCollection SimCalorimeterHit 0
---------------------------------------------------------------------------
For the last bullet I need to wait ten hours, the time required by the reconstruction job, in any case I started also a script to monitor memory usage
Thanks for you help.
ale
Hi,
overlaying 1000 times this amount of background is quite big. I guess you are reaching the limitations due to the OS on memory allocation by malloc
.
See: http://www.qnx.com/developers/docs/6.5.0SP1.update/com.qnx.doc.neutrino_lib_ref/m/malloc.html
In particular:
Because the malloc() implementation uses signed, 32-bit integers to represent the size internally, you can't allocate more than 2 GB in a single allocation. If the size is greater than 2 GB, malloc() indicates an error of ENOMEM.
If you know which event it is in particular, you can try to start your job at this event and before writing the event with the LCIOOutputProcessor
, you could run a dumpEvent
in the code (not dumpEventDetailed
the output will be too long).
Another option might be to drop the MCParticle collection of the overlay events - this is what we do for the e+e- pair background overlay in ILC...
This is the plot of the rss, the machine has 32Gb of RAM + 4Gb of swap.
@rete I have also the suspect that there is a limitations due to the OS on memory allocation
@gaede I'm trying your option
The problem is that this is only a test, in production I'll have 1 signal event plus about 15k bkg events like the ones that I have posted, so probably I need a "more robust" workaround. In any case I let you know if the @gaede suggestion should be enough
thanks
Is this still a problem? As the SIO package was re-written, maybe try with the latest version of LCIO/SIO?
I guess it will give the same error. The implementation is similar in the new SIO.
But as far as I can tell malloc is no longer called with an unsigned int (maximum 4 GB though) I can't even tell if malloc is called at all, but there are now size_t types used I think (you can probably confirm?)
The documentation you link above is only for a particular implementation, in principle allocation should work for larger blocks https://stackoverflow.com/questions/21132238/allocation-of-more-than-2gb-fails-on-64-bit-binary
But unsigned int was used before https://github.com/iLCSoft/LCIO/blob/05e5bb12e9baea08c6932b8652a17f932d96e736/sio/src/SIO_stream.cc#L483 https://github.com/iLCSoft/LCIO/blob/05e5bb12e9baea08c6932b8652a17f932d96e736/sio/include/SIO_stream.h#L130
So I hope we don't have to guess.
Unfortunately this seems to be not enough to fix the problem
Hi, I'm using ILCSoft for muoncollider simulation/reconstruction. At the end of a long reconstruction I have this message (activating the VERBOSE flag in lcio):
A runtime error occured - (uncaught exception): lcio::IOException: [SIOWriter::writeEvent] couldn't write event record to stream: Output_REC_000_slcio0 Marlin will have to be terminated, sorry.
And in the log file:
[ VERBOSE "Output_REC"] SIO: [Output_REC_000_slcio0/LCEvent/] Allocated a 536870912(0x20000000) byte buffer [ VERBOSE "Output_REC"] SIO: [Output_REC_000_slcio0/LCEvent/] Allocated a 1073741824(0x40000000) byte buffer [ VERBOSE "Output_REC"] SIO: [Output_REC_000_slcio0/LCEvent/] Buffer allocation failed
So it seems that it is not able to allocate 2Gb of buffer.
I have tried to play with the ulimit command:
max memory size (kbytes, -m) unlimited stack size (kbytes, -s) 2202000 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
But it doesn't help.
Do you have any suggestions?