catalog_traces seems have memory leak

weiliu620 commented 8 years ago

When I read a large segy file (~130 GB), I found python memory usage keep increasing (from top command) even at the catalog_traces stage. It once reaches to 30 GB memory usage before I kill the job.

From the code, the catalog_traces read each trace from segy data and then construct a few python data structures for inline/xline numbers etc. I don't have the experience of profiling python memory usage, so I was stuck here.

I can provide more information if it helps.

wassname commented 8 years ago

I had this problem too. If I remember correctly it reads the whole file to build a catalogue so it's slow an memory intensive. One way around it is to assume it has a fixed trace layout (often the case) and then read one trace at a time.

I did this here by extending the segpy classes: https://gist.github.com/wassname/4bd878e4d24e27a6bbaedfff4a4e7b37. This is for segpy commit: 91562fddfd6d8424ee4161f4417982243512c150. To use it call create_read_writer with fast=True. Then it quickly reads the binary header and builds a simple fixed length trace catalogue.

Then you can edit the file in place too (make a copy first) using write_trace_header, write_binary_reel_header, etc from the SegyWriter class.

Another way would be to change the way it catalogues traces, but I never tried that.

rob-smallshire commented 8 years ago

@weiliu620 I'm fairly certain there is no memory leak.

segpy is building various indexes into your file to permit random access to traces. If your data really does have a regular geometry segpy will prove this to its satisfaction and then compress the index down to a tiny representation which it will save on disk for loading next time this file is loaded. What this means is that if you can wait for the file to load at least once, it will load much faster on subsequent attempts.

That said, @wassname is right that it's possible to do better if you know something about the geometry of the data in advance. Either you can use something like @wassname's solution, or trace through create_reader() with a debugger or print statements and provide the required indexes to the SegYReader3D initializer.

Rather than adding yet more arguments to create_reader() I'd prefer to provide alternative factory functions which assume certain SEG Y geometries such, create_reader_3d() specifically for 3D files, or create_reader_3d_complete() for 3D files containing all possible crosslines, inlines and samples.

If you send me the first 10 kB of your 130 GB file (use: head -c 10000 your.segy > your.headers) I will be able to cook something up which works for you, but there's no guarantee when, or even if, I'll find the time. If this enhancement is time critical to you, consider contacting me at rob@sixty-north.com so we can discuss using our contracting services to get this implemented quickly.

weiliu620 commented 8 years ago

Sorry for being late to reply. Thanks for the inputs! Unfortunately I haven't got a chance to further investigate this issue due to project priority change. I'll close it for now and maybe re-open it later when I revisit it.

sixty-north / segpy

catalog_traces seems have memory leak #29