jpjones76 / SeisIO.jl

Julia language support for geophysical time series data
http://seisio.readthedocs.org
Other
47 stars 21 forks source link

Data request file naming #24

Closed tclements closed 4 years ago

tclements commented 4 years ago

When get_data(.., w=true) is in write mode, if the source is "FDSN" the output file name is YYYY.JJJ.HH.MM.SS.000000.FDSNWS.IRIS.mseed whereas if the source is "IRIS" the output file name YYY.JJJ.HH.MM.SS.(NET).STA.LOC.CHAN.R.mseed. e.g.

S = get_data("FDSN", "UW.LON..BHZ", src="IRIS",s="2019-01-01",t=600,w=true) writes a file named 2019.1.00.00.00.000000.FDSNWS.IRIS.mseed

and

S = get_data("IRIS", "UW.LON..BHZ",s="2019-01-01",t=600,w=true) writes a file named 2019.1.00.00.00.UW.LON..BHZ.R.mseed.

This should be a simple fix to FDSNget!.

jpjones76 commented 4 years ago

I can change this very quickly for a single-channel request. Is that what you're looking for?

tclements commented 4 years ago

Yes, would be good to have consistent file naming, with network, station, location, and channel included.

jpjones76 commented 4 years ago

It's not an inconsistency on my part. It's a fundamental difference between IRIS timeseries and FDSN datselect.

IRIS' timeseries service allows one channel per request and no wildcards. That's why I'm able to stream to file with an automatic channel name: the channel is known, and unique, when the request begins.

Not so for FDSN.

FDSN dataselect allows multiple channels and wildcard requests. If a request contains multiple channel strings, it's impossible to define an accurate naming convention without processing the download.

I've thought about changing this before, but FDSN filenames will only be accurate with that naming style when one requests a single channel at a time with no wildcards.

That defeats the purpose of FDSN.

A single-channel FDSN request adds overhead associated with opening and closing multiple HTTP requests. I'm not really sure why your group uses them, honestly. Unless requests are very long (e.g, Kurama's example of a 1-year request), the overhead is significant. HTTP requests are already slow relative to other data transfer protocols, and both FDSN and SeisIO are built around the concept of accessing multiple channels at once...

Moreover, if I automatically name files with FDSN, my choices are either to introduce a true internal inconsistency, to create file names so turgid that humans can't read them, or to create file names that contain no identifying information.

I can demonstrate this using your example:

So, I'm willing to modify how file naming works in FDSNget, but the options aren't great:

tclements commented 4 years ago

Thank you, I did not know about the FDSN overhead for a single channel. Could FDSN throw a warning if a user tries to get data for a single channel? For users, I think it should be explicit to use IRIS for single channel requests and FDSN for multi-station requests.

jpjones76 commented 4 years ago

I'll add that to the documentation as an explicit recommendation, but that choice only exists for data archived at IRIS. IRIS staff tell me that many channels listed in their metadata aren't actually there -- the issue your group was having with SCSN stations was caused by that.

jpjones76 commented 4 years ago

Oh, the overhead isn't from FDSN itself; I mean that there's overhead associated with requesting N channels one at a time when you only need one request. For a short request, concatenating requests with chans = join(all_channels, ",") then passing chans to get_data will be faster than N total get_data calls to N strings in all_channels.

Paralellization changes that, though. Requests like Kurama's benchmark are almost certainly faster with N parallel get_data calls because the overhead from making N calls is more than compensated for by the speed improvement of doing them in parallel. So that's not why he gets a low exponent in Fig. 4. When I incorporate his code into SeisIO core, I'll fork each string in the data request to one request on one CPU within get_data itself; reducing the number of get_data calls to one will eliminate much of the overhead and increase his exponent.

jpjones76 commented 4 years ago

The solution I recommended above is now live. Pass keyword autoname=true to generate IRIS-style file names in single-channel FDSN requests.