Data request file naming

tclements commented 4 years ago

When get_data(.., w=true) is in write mode, if the source is "FDSN" the output file name is YYYY.JJJ.HH.MM.SS.000000.FDSNWS.IRIS.mseed whereas if the source is "IRIS" the output file name YYY.JJJ.HH.MM.SS.(NET).STA.LOC.CHAN.R.mseed. e.g.

S = get_data("FDSN", "UW.LON..BHZ", src="IRIS",s="2019-01-01",t=600,w=true) writes a file named 2019.1.00.00.00.000000.FDSNWS.IRIS.mseed

and

S = get_data("IRIS", "UW.LON..BHZ",s="2019-01-01",t=600,w=true) writes a file named 2019.1.00.00.00.UW.LON..BHZ.R.mseed.

This should be a simple fix to FDSNget!.

jpjones76 commented 4 years ago

I can change this very quickly for a single-channel request. Is that what you're looking for?

tclements commented 4 years ago

Yes, would be good to have consistent file naming, with network, station, location, and channel included.

jpjones76 commented 4 years ago

It's not an inconsistency on my part. It's a fundamental difference between IRIS timeseries and FDSN datselect.

IRIS' timeseries service allows one channel per request and no wildcards. That's why I'm able to stream to file with an automatic channel name: the channel is known, and unique, when the request begins.

Not so for FDSN.

FDSN dataselect allows multiple channels and wildcard requests. If a request contains multiple channel strings, it's impossible to define an accurate naming convention without processing the download.

I've thought about changing this before, but FDSN filenames will only be accurate with that naming style when one requests a single channel at a time with no wildcards.

That defeats the purpose of FDSN.

A single-channel FDSN request adds overhead associated with opening and closing multiple HTTP requests. I'm not really sure why your group uses them, honestly. Unless requests are very long (e.g, Kurama's example of a 1-year request), the overhead is significant. HTTP requests are already slow relative to other data transfer protocols, and both FDSN and SeisIO are built around the concept of accessing multiple channels at once...

Moreover, if I automatically name files with FDSN, my choices are either to introduce a true internal inconsistency, to create file names so turgid that humans can't read them, or to create file names that contain no identifying information.

I can demonstrate this using your example:

Suppose your channel string is "UW.LON..BHZ". That's easily mapped to a file named "2019.1.00.00.00.UW.LON..BHZ.R.mseed".
Wildcards can map to a "safe" character like the underscore, e.g. "UW.LON.B*" => "2019.1.00.00.00.UW.LON..B__.R.mseed".
"UW.LON..BH?, CC.VALT..*" becomes "2019.1.00.00.00.UW.LON..BH_.R-CC.VALT..___.R.mseed". This is getting turgid.
"UW.LON..BHZ, UW.LON..BHE, UW.LON.BHN"? becomes "2019.1.00.00.00.UW.LON..BHZ.R-UW.LON..BHE.R-UW.LON..BHN.R.mseed". Ow...
This naming convention absolutely breaks at ~16 channel strings: most Linux file systems only allow file name lengths < 256 characters.
"UW.LON..BHZ, UW.LON..BHE, UW.LON.BHN" creates a new inconsistency.
- The file name will be "2019.1.00.00.00.UW.LON..BHZ.R-UW.LON..BHE.R-UW.LON..BHN.R.mseed".
- The same three channels requested by "UW.LON.B*" yield filename "2019.1.00.00.00.UW.LON..B__.R.mseed".
  - So filenames would no longer be unique.
This can't be resolved: if I replace non-unique character slots in the filename string with underscores, that breaks request strings like "UW.LON..BH?, CC.VALT..*" -- the filename is "2019.1.00.00.00.__.____..___.mseed". No one can tell what's in it.

So, I'm willing to modify how file naming works in FDSNget, but the options aren't great:

If I change the code to auto-generate names, I need a new Boolean KW. I think this is the best option.
- I can add that KW very quickly (it might take two hours with testing), but documentation of its behavior will be needed, because that introduces an internal inconsistency in FDSN file naming.
- Documentation will lag the push that adds it.
If I leave the code as is, then IRIS and FDSN filenames remain inconsistent.
- This is the current convention because their behavior is fundamentally different.
If I change FDSNget! to automatically name files without an additional KW, that introduces more problems than it solves, as the above example shows. I don't like this option.
- FDSN filenames become internally inconsistent by design.
- Requests can throw low-level errors if strings are too long; it imposes an unwritten maximum on the number of request strings, and that maximum is variable (depends on string length).

tclements commented 4 years ago

Thank you, I did not know about the FDSN overhead for a single channel. Could FDSN throw a warning if a user tries to get data for a single channel? For users, I think it should be explicit to use IRIS for single channel requests and FDSN for multi-station requests.

jpjones76 commented 4 years ago

I'll add that to the documentation as an explicit recommendation, but that choice only exists for data archived at IRIS. IRIS staff tell me that many channels listed in their metadata aren't actually there -- the issue your group was having with SCSN stations was caused by that.

jpjones76 commented 4 years ago

Oh, the overhead isn't from FDSN itself; I mean that there's overhead associated with requesting N channels one at a time when you only need one request. For a short request, concatenating requests with chans = join(all_channels, ",") then passing chans to get_data will be faster than N total get_data calls to N strings in all_channels.

Paralellization changes that, though. Requests like Kurama's benchmark are almost certainly faster with N parallel get_data calls because the overhead from making N calls is more than compensated for by the speed improvement of doing them in parallel. So that's not why he gets a low exponent in Fig. 4. When I incorporate his code into SeisIO core, I'll fork each string in the data request to one request on one CPU within get_data itself; reducing the number of get_data calls to one will eliminate much of the overhead and increase his exponent.

jpjones76 commented 4 years ago

The solution I recommended above is now live. Pass keyword autoname=true to generate IRIS-style file names in single-channel FDSN requests.

jpjones76 / SeisIO.jl

Data request file naming #24