trhallam / segysak

SEGY Swiss Army Knife for Seismic Data
https://trhallam.github.io/segysak/
GNU General Public License v3.0
95 stars 32 forks source link

Direct convert: segy to netcdf #121

Closed etotten93 closed 6 months ago

etotten93 commented 9 months ago

Hi @trhallam , the ***.seisio.to_netcdf command is very useful for segy_loader objects already loaded; however, I have a memory issue when trying to load above a certain number of traces, "Unable to allocate 3.67 TiB for an array with shape (202, 448, 5001, 2228) and data type float32".

I am wondering if segysak has a command to directly convert the full/original segy file without needing to use segy_loader first (some of my SEGY files are > 50GB in size).

Many thanks in advance!

trhallam commented 9 months ago

The converter function or CLI should do what you want.

https://segysak.readthedocs.io/en/latest/generated/segysak.segy.segy_converter.html#segysak-segy-segy-converter

Note that converting to netcdf might inflate the size of your file.

etotten93 commented 9 months ago

Brilliant - this converted quite quickly. From here, is it still possible to header_scan &/or header_scrape as done previously on segy files to query the new netcdf file, using segy_header_scan and segy_header_scrape? Or do I need to call a new set of functions?

Many thanks!

trhallam commented 9 months ago

Any header fields you didn't specify are not converted to data variables in netcdf. If you want additional byte fields you should look at the extra_byte_fields keyword argument for the converter.

Given the size of the data you are working with. This example might be useful for you.

https://segysak.readthedocs.io/en/latest/examples/example_segysak_dask.html

seisnc netcdf files should be opened with the open_seisnc function to properly account for some attributes that are added to the seisnc format. Dask will allow you to perform lazy loading, otherwise Python will be greedy and try to load the volume into memory.

etotten93 commented 9 months ago

Thanks @trhallam, starting to get the hang of this now! I have noticed one small hiccup. I am currently trying to look at data one iline at a time (~800 traces). When I use segy_converter over all ~ 800 desired traces using:

segy_converter(segy_file, seisnc_file, iline=189, xline=193, cdpx=181, cdpy=185,offset=None,vert_domain="TWT",\
data_type="AMP",ix_crop=(233522,233522,716661,723670),cdp_crop=None,xy_crop=None,z_crop=None,return_geometry=False,\
silent=False,extra_byte_fields=None)

which gives ValueError: cannot convert a DataFrame with a non-unique MultiIndex into xarray. This makes sense as offset is set at byte location None. However, if I set the offset byte location (37), I get: TypeError: Simple selection can't process 0.0

This doesn't happen if I choose to plot, say, half the traces in one iline (~400 traces) - all works as expected. Is this something that you have encountered (byte location set as none in segy_converter)? Thanks!

trhallam commented 9 months ago

Can you please post the full stack trace you get when the error occurs?

I'd also add that your model is a fairly inefficient way to scan through the data. Each time you run the converter it will have to perform a full header scan. You can either scrape the headers once and store the df, passing it back to the converter with the head_df argument or you should convert the entire seismic cube to netcdf in one go.

etotten93 commented 9 months ago

For something that previously works (over roughly 401 traces) when offset =None, setting offset=37 produces the following error:

TypeError                                 Traceback (most recent call last)
<ipython-input-4-437c26593efa> in <module>
    25 #01:08 in lecture -> use trace header map for this new variable
    26 seisnc_file = './nav_merge_050A069_1.nc'
---> 27 segy_converter(segy_file, seisnc_file, iline=189, xline=193, cdpx=181, cdpy=185,offset=37,vert_domain="TWT",\
    28 data_type="AMP",ix_crop=.    (233522,233522,716661,721167),cdp_crop=None,xy_crop=None,z_crop=None,return_geometry=False,\
    29 silent=False,extra_byte_fields=None)

~/anaconda3/PATH/TO/segysak/segy/_segy_loader.py in segy_converter(segyfile, ncfile, cdp, iline,     xline, cdpx, cdpy, offset, vert_domain, data_type, ix_crop, cdp_crop, xy_crop, z_crop, return_geometry, silent,          extra_byte_fields, **segyio_kwargs)
  1312
  1313         print("is_3d")
-> 1314         ds = _3dsegy_loader(
  1315             *common_args,
  1316             **common_kwargs,

~/anaconda3/PATH/TO/segysak/segy/_segy_loader.py in _3dsegy_loader(segyfile, head_df, head_bin,     head_loc, ncfile, vert_domain, data_type, crop, zcrop, silent, return_geometry, **segyio_kwargs)
   456     # prestack data
   457     if head_loc.offset is not None and not isinstance(ncfile, xr.Dataset):
--> 458         ds = _segy3dps_ncdf(
   459             segyfile,
   460             ncfile,

 ~/anaconda3/PATH/TO/segysak/segy/_segy_loader.py in _segy3dps_ncdf(segyfile, ncfile, segyio_kwargs, n0, ns, head_df, head_loc, vert_domain, silent)
   203         for contig, grp in head_df.groupby(contig_dir):
   204             for trc, val in grp.iterrows():
--> 205                 seisnc_data[val.il_index, val.xl_index, :, val.off_index] = segyf.trace[
    206                     trc
   207                 ][n0 : ns + 1]

~/anaconda3/PATH/TO/h5netcdf/core.py in __setitem__(self, key, value)
   356             # resize on write only for legacy API
   357             self._maybe_resize_dimensions(key, value)
--> 358         self._h5ds[key] = value
   359
   360     @property

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

~/anaconda3/PATH/TO/h5py/_hl/dataset.py in __setitem__(self, args, val)
   945
   946         # Perform the dataspace selection
--> 947         selection = sel.select(self.shape, args, dataset=self)
   948
   949         if selection.nselect == 0:

~/anaconda3/lPATH/TO/h5py/_hl/selections.py in select(shape, args, dataset)
    80         selector = _selector.Selector(space)
    81
---> 82     return selector.make_selection(args)
    83
    84

h5py/_selector.pyx in h5py._selector.Selector.make_selection()

h5py/_selector.pyx in h5py._selector.Selector.apply_args()

TypeError: Simple selection can't process 0.0
etotten93 commented 9 months ago

And the error (when trying to plot all traces in one iline):

ValueError                                Traceback (most recent call last)
<ipython-input-1-76cd4d1cfd92> in <module>
    25 #01:08 in lecture -> use trace header map for this new variable
    26 seisnc_file = './XXXXX.nc'
---> 27 segy_converter(segy_file, seisnc_file, iline=189, xline=193, cdpx=181,     cdpy=185,offset=None,vert_domain="TWT",\
    28 data_type="AMP",ix_crop=.   (233522,233522,716661,723677),cdp_crop=None,xy_crop=None,z_crop=None,return_geometry=False,\
    29 silent=False,extra_byte_fields=None)

~/anaconda3/PATH/TO/segysak/segy/_segy_loader.py in segy_converter(segyfile, ncfile, cdp, iline, xline, cdpx, cdpy,     offset, vert_domain, data_type, ix_crop, cdp_crop, xy_crop, z_crop, return_geometry, silent, extra_byte_fields, **segyio_kwargs)
  1345     indexer = indexer + ["off_index"] if offset is not None else indexer
  1346
-> 1347     ds = _loader_converter_write_headers(
  1348         ds, head_df, indexer, dims, extra_byte_fields, is3d2d=is3d2d
  1349     )

~/anaconda3/lPATH/TO/segysak/segy/_segy_loader.py in _loader_converter_write_headers(ds, head_df,     indexer, dims, extra_byte_fields, is3d2d)
   889     # we have some some geometry to assign headers to
   890     if is3d2d:
--> 891         head_ds = head_df.set_index(indexer).to_xarray()
   892         for key, field in extra_byte_fields.items():
   893             ds[key] = (dims, head_ds[field].values)

~/anaconda3/PATH/TO/pandas/core/generic.py in to_xarray(self)
  3107             return xarray.DataArray.from_series(self)
  3108         else:
-> 3109             return xarray.Dataset.from_dataframe(self)
  3110
  3111     @final

~/anaconda3/lPATH/TO/xarray/core/dataset.py in from_dataframe(cls, dataframe, sparse)
  5508
  5509         if isinstance(idx, pd.MultiIndex) and not idx.is_unique:
-> 5510             raise ValueError(
  5511                 "cannot convert a DataFrame with a non-unique MultiIndex into xarray"
  5512             )

ValueError: cannot convert a DataFrame with a non-unique MultiIndex into xarray

My code for the above looks something like:

from segysak.segy import segy_header_scan,segy_header_scrape,get_segy_texthead,segy_writer,segy_loader,segy_converter from segysak import open_seisnc import pandas as pd import matplotlib.pyplot as plt segy_file = './XXXXX.sgy' get_segy_texthead(segy_file)

scan = segy_header_scan(segy_file,max_traces_scan=10000) with pd.option_context("display.max_rows",100): display(scan)

trace_headers = segy_header_scrape(segy_file,partial_scan=10000) trace_headers.columns with pd.option_context("display.max_rows",5,"display.max_columns",200): display(trace_headers) with pd.option_context("display.max_rows",10000): display(trace_headers[["INLINE_3D","CROSSLINE_3D","offset"]]) fig, axs = plt.subplots(nrows=2,ncols=2,figsize=(20,10),sharex=True) for ax, prop in zip(axs.ravel(),["CDP_X","CDP_Y", "INLINE_3D","CROSSLINE_3D"]): ax.plot(trace_headers[prop]) ax.set_title(prop,fontdict={"fontsize":18})

Write other variables

#seisnc_vol["xy"] = seisnc_vol["cdp_x"] * seisnc_vol["cdp_y"] / 1E10
#01:08 in lecture -> use trace header map for this new variable
seisnc_file = './XXXXX.nc'
segy_converter(segy_file, seisnc_file, iline=189, xline=193, cdpx=181, cdpy=185,offset=None,vert_domain="TWT",\
data_type="AMP",ix_crop=(233522,233522,716661,723677),cdp_crop=None,xy_crop=None,z_crop=None,return_geometry=False,\
silent=False,extra_byte_fields=None)
from dask.distributed import Client
client = Client()
client
client.cluster.scale(2, memory="0.5gb")
client
seisnc = open_seisnc(seisnc_file)
seisnc.seis.humanbytes
mean = seisnc.data.mean()
mean
mean.compute().values
fig, axs = plt.subplots(nrows=1, ncols=1, figsize=(20, 10))
iline = seisnc.sel(iline=233522).transpose("twt", "xline").data
iline.plot(robust=True, yincrease=False)

And if I try repeat this with a not None offset byte location, I get the same error as in the previous message ('TypeError: Simple selection can't process 0.0')

trhallam commented 9 months ago

Are you trying to overwrite an existing file?

etotten93 commented 9 months ago

Are you trying to overwrite an existing file?

I have removed any .nc files in the run directory to check this and still get the TypeError: Simple selection can't process 0.0'

trhallam commented 9 months ago

If you remove the cropping information and set the return_geometry=True flag do you get the same error?

etotten93 commented 9 months ago

Sorry for the delay - on a train. Setting ix_crop = None and return_geometry=True yields:

162k/162k [00:06<00:00, 23.9k traces/s] header_loaded is_3d

KeyError                                  Traceback (most recent call last)
<ipython-input-5-8306e15c9b17> in <module>
     25 #01:08 in lecture -> use trace header map for this new variable
     26 seisnc_file = './XXXXX.nc'
---> 27 segy_converter(segy_file, seisnc_file, iline=189, xline=193, cdpx=181, cdpy=185,offset=37,vert_domain="TWT",\
     28 data_type="AMP",ix_crop=None,cdp_crop=None,xy_crop=None,z_crop=None,return_geometry=True,\
     29 silent=False,extra_byte_fields=None)

~/anaconda3/PATH/TO/segysak/segy/_segy_loader.py in segy_converter(segyfile, ncfile, cdp, iline, xline, cdpx, cdpy, offset, vert_domain, data_type, ix_crop, cdp_crop, xy_crop, z_crop, return_geometry, silent, extra_byte_fields, **segyio_kwargs)
   1354     with h5netcdf.File(ncfile, "a") as seisnc:
   1355         for var, darray in new_vars.items():
-> 1356             seisnc_var = seisnc.create_variable(var, darray.dims, darray.dtype)
   1357             seisnc_var[...] = darray[...]
   1358             seisnc.flush()

~/anaconda3/PATH/TO/h5netcdf/core.py in create_variable(self, name, dimensions, dtype, data, fillvalue, chunks,     chunking_heuristic, **kwargs)
    867         for k in keys[:-1]:
    868             group = group._require_child_group(k)
--> 869         return group._create_child_variable(
    870             keys[-1],
    871             dimensions,

~/anaconda3/lib/PATH/TO/h5netcdf/core.py in _create_child_variable(self, name, dimensions, dtype, data, fillvalue, chunks, chunking_heuristic, **kwargs)
    692 
    693         # get shape from all dimensions
--> 694         shape = tuple(self._all_dimensions[d].size for d in dimensions)
    695         maxshape = tuple(self._all_dimensions[d]._maxsize for d in dimensions if d)
    696 

~/anaconda3/PATH/TO/h5netcdf/core.py in <genexpr>(.0)
    692 
    693         # get shape from all dimensions
--> 694         shape = tuple(self._all_dimensions[d].size for d in dimensions)
    695         maxshape = tuple(self._all_dimensions[d]._maxsize for d in dimensions if d)
    696 

~/anaconda3/PATH/TO/collections/__init__.py in __getitem__(self, key)
    939             except KeyError:
    940                 pass
--> 941         return self.__missing__(key)            # support subclasses that define __missing__
    942 
    943     def get(self, key, default=None):

~/anaconda3/PATH/TO/collections/__init__.py in __missing__(self, key)
    931 
   932     def __missing__(self, key):
--> 933         raise KeyError(key)
    934 
    935     def __getitem__(self, key):

KeyError: 'iline'

which appears to be a different error. Note offset=37 (setting as None yields ValueError: cannot convert a DataFrame with a non-unique MultiIndex into xarray with return_geometry=True and ix_crop=None). Thanks for your patience!

trhallam commented 9 months ago

Yeah, I'm not 100% on what's going on here, often it's data related. Did you say the data was publicly available? I might have some time to look at it over the next few days.

Otherwise, I would say that the way using segysak is a bit counterintuitive. I'll try to rework your code to be more idiomatic tomorrow.

etotten93 commented 9 months ago

Thanks for your help, I'll get in touch tomorrow about this.

etotten93 commented 9 months ago

Having made good progress, I am wondering why segy_converter does not like being supplied with extra_byte_fields={"SourceX":73,"SourceY":77}) (or other header attributes), erroring out with the following.

In [20]: segy_converter(segy_file, seisnc_file, iline=189, xline=193, cdpx=181, cdpy=185,offset=None,vert_domain="TWT",\ ...: data_type="AMP",ix_crop=None,cdp_crop=None,xy_crop=None,z_crop=None,return_geometry=False,\ ...: silent=False,extra_byte_fields={"SourceX":73,"SourceY":77})

ValueError Traceback (most recent call last)

in ----> 1 segy_converter(segy_file, seisnc_file, iline=189, xline=193, cdpx=181, cdpy=185,offset=None,vert_domain="TWT",\ 2 data_type="AMP",ix_crop=None,cdp_crop=None,xy_crop=None,z_crop=None,return_geometry=False,\ 3 silent=False,extra_byte_fields={"SourceX":73,"SourceY":77}) ~/anaconda3/XXXX/_segy_loader.py in segy_converter(segyfile, ncfile, cdp, iline, xline, cdpx, cdpy, offset, vert_domain, data_type, ix_crop, cdp_crop, xy_crop, z_crop, return_geometry, silent, extra_byte_fields, **segyio_kwargs) 1276 check_tracefield(byte_loc) 1277 -> 1278 head_df, head_bin, head_loc = _loader_converter_header_handling( 1279 segyfile, 1280 cdp=cdp, ~/anaconda3/XXXX/_segy_loader.py in _loader_converter_header_handling(segyfile, cdp, iline, xline, cdpx, cdpy, offset, vert_domain, data_type, ix_crop, cdp_crop, xy_crop, z_crop, return_geometry, silent, extra_byte_fields, head_df, optimised_load, **segyio_kwargs) 788 if head_df is None: 789 # Start by scraping the headers. --> 790 head_df = segy_header_scrape( 791 segyfile, silent=silent, bytes_filter=scrape_bytes, **segyio_kwargs 792 ) ~/anaconda3/XXXX/_segy_headers.py in segy_header_scrape(segyfile, partial_scan, silent, bytes_filter, chunk, **segyio_kwargs) 82 pandas.DataFrame: Raw header information in table for scanned traces. 83 """ ---> 84 check_tracefield(bytes_filter) 85 86 assert (chunk > 0) and isinstance(chunk, int) ~/anaconda3/lXXXX/_segy_core.py in check_tracefield(byte_list) 43 ] 44 if failed: ---> 45 raise ValueError(f"Invalid byte locations: {failed}") 46 47 return True ValueError: Invalid byte locations: ['SourceX', 'SourceY'] Thanks!
trhallam commented 9 months ago

Please use keys that match values from segyio.tracefield.keys

teoeq commented 6 months ago

Anyone tried to convert segy to netcdf that could be used as an input in ArcGIS Pro to create voxel layer? Thanks

trhallam commented 6 months ago

Issue resolved by user.