Initial read and subsequently reading headers is slow

pareantoine commented 7 years ago

Hi there,

I am fairly new to Python and even more to Segpy. I am using Segpy on VSP dataset and not surface seismic and I find the initial loading fairly slow. (I can't see any reason that VSP and not surface seismic would make much of a difference). 230 seconds roughly for a 600Mb file compared to 30 seconds with Obspy.

Once loaded I find reading one specific header but for all traces as slow. If I wanted for example to plot all source X and Y coordinates of all traces in the SEGY, I would pull the coordinates with the following code for the X coordinates: np.array([segy_reader.trace_header(trace_index).source_x for trace_index in segy_reader.trace_indexes()])

Again, this would take not far from 230 seconds when I would have imagined that the headers should be in memory and we wouldn't need to load them again? (With Obspy it's almost instant)

I like Segpy a lot because it's much faster to read the actual samples of a trace than Obspy and it seems much easier to modify the samples and then save it, although I haven't tried it yet.

I would appreciate any help as I'm planning to upscale the use of these code from SEG-Y files with about ~300,000 traces to millions.

rob-smallshire commented 7 years ago

Are you able to share your SEG Y file?

On my computer (MacBook Pro), the initial read for a 700 MB SEG Y file containing >120000 traces takes about 7 seconds. Subsequent reads, once the index has been cached take about 1 second.

pareantoine commented 7 years ago

Unfortunately I cannot share the SEG-Y file.

I've tried on a MacBook Pro and a PC and both are struggling. 300 seconds to load the file but only 9 seconds to go through all the traces. I am not loading surface seismic data but borehole seismic data (VSP) which has slightly different headers and no inline-crossline or proper shotpoints.

I wonder if it is slow because segpy reads it as a 3D seismic volumes as the dimensionality of the segy_reader is 3.

How can I force create_reader to read my SEG-Y file as a SegYReader and not a SegYReader3D object?

rob-smallshire commented 7 years ago

I've submitted an issue on your own repo with some suggestions which should help a lot: https://github.com/pareantoine/VSP-Processing/issues/1

rob-smallshire commented 7 years ago

"How can I force create_reader to read my SEG-Y file as a SegYReader and not a SegYReader3D object?"

This won't be affecting the performance, it just means you get some extra functionality on the resulting SegYReader object which is useless but harmless in your case.

If you can submit another issue we can deal with this separately from your performance concerns.

pareantoine commented 7 years ago

Hi Rob, thanks a lot for your answers.

I might do some test with the tool kit to figure out which step of the catalog_traces function is killing it for my SEG-Y as it's clearly the one taking a very long time.

I'll come back to you if I find anything.

rob-smallshire commented 7 years ago

That will probably be fruitful. It's likely that in your case segpy is doing lots of unnecessary work when it is trying to index the data. A plain SegYReader only needs a trace_offset_catalog and a trace_length_catalog so a cut-down version of the existing, very generic, code path could make a big difference. Segpy has two APIs, the higher-level "reader/writer" API and a lower-level API defined mostly in segpy.toolkit. You may be able to figure out which low-level calls are actually needed.

pareantoine commented 7 years ago

Rob, the solution was fairly simple. While going through the code on my laptop and on github I realised minor differences. I removed segpy from my environment and reinstalled it directly from the github website. It now takes 33 seconds.

I believe that running pip install segpy isn't actually installing the latest version which is available on github.

rob-smallshire commented 7 years ago

Great. I'll push a new version to PyPI soon to be installed with pip.

Did you make the other changes I suggested too?

pareantoine commented 7 years ago

No I haven't had much time to look at the rest of my code but I'll do it soon and I'll test it. I'll also add an adapted trace header format that fits better with VSP data.

rob-smallshire commented 7 years ago

What is the Data Sample Format for your data? If it's IBM Float we have a C++ plugin which makes reading that format about 10x faster.

pareantoine commented 7 years ago

I've got both IEEE and IBM, I'll test the difference in loading time.

pareantoine commented 7 years ago

Well strangely enough, I've got two dataset with the same number of traces, sample number and size (slightly different headers and data). 5 seconds to load the IEEE float32, 35 seconds for the IBM float32...

edit: seems to change, just restarted the kernel and loaded the IBM in 3 seconds.

rob-smallshire commented 7 years ago

Hmm. You should expect reading of IBM trace samples to be substantially slower than reading IEEE samples. The times for the headers should be the same.

sixty-north / segpy

Initial read and subsequently reading headers is slow #43