Future ASDF expansion - Githubissues

jpjones76 commented 4 years ago

Hi everyone,

I'd like to make this the new home of our (previous email) discussions on ASDF support.

Current status

SeisIO.SeisHDF has ASDF read/write support
Read support exists for Waveforms (including StationXML) and QuakeML via read_hdf5
Write support exists for Waveforms (including StationXML) and QuakeML via write_hdf5
Miscellaneous functions in SeisHDF allow finer control over read, write, and scan

In progress

Method extension to SeisNoise.jl
AuxiliaryData
- Planned for SeisNoise.jl
- General extension is not possible in SeisIO except with a thin wrapper to test for existence and prepend "AuxiliaryData/" to a path string
Provenance
- I've had a rudimentary reader working for some time but can't push to GitHub without test data. No one -- even Lion Krischer, who created ASDF -- has been able to send me a test file that associates Provenance with Waveforms as described in the file format.
- I tried testing Provenance with Lion's ASDF validator but the validator won't run. I emailed him about this last week. He seems very busy so a reply might take time.
- Writer NYI

Proposed extensions

Cross-correlations in AuxiliaryData: @mdenolle has sent me specifications for implementing read/write
@tclements notes that support for AuxiliaryData can expand to other communities (SPECFEM, receiver function packages, geodesy).
@tclements has suggested making ASDF a separate module.

jpjones76 commented 4 years ago

I'll state here that I don't like the idea of making ASDF a separate package at all. Too much work for too little payoff.

For example, if we separate ASDF from SeisIO:

What are the outputs from the ASDF reader?
What are the inputs to the ASDF writer?
What are the outputs from the XML readers?
What are the inputs to the XML writers?

If the answers are SeisData structures, or similar subtypes of GphysData, then why create a separate package?

If no answers involve SeisIO Types, then where do we store the information? Someone will need to create and test new Types for each. At best those might work like SeisIO structures with less flexibility. So the advantage is perhaps being able to rename a few fields. Is that worth hundreds of hours of our time?

(As a gentle reminder, I put >1000 hours into coding and testing SeisIO before we started working together. Types were the most time-consuming task. Working on new Types full-time would take >2 months if nothing more urgent came up in SeisIO itself -- a risky assumption.)

I don't see how separating ASDF is necessary or useful to make it appealing to other communities. SeisIO was created as a general package for using univariate geophysical data in Julia; it's not a dedicated back-end layer for ambient noise. ASDF, on the other hand, was created as a dedicated seismic data format; there's no guarantee that geodesists will even want to use it.

tclements commented 4 years ago

If the answers are SeisData structures, or similar subtypes of GphysData, then why create a separate package?

To clarify, I am not in favor of separating SeisIO from ASDF. ASDF should depend on SeisIO for input/output (as will all future Julia Seismology packages). My reason for suggesting a new package is I think what Josh has done with ADSF is substantial enough to warrant its own package (similar to how pyasdf and obspy interact and yet are separate python modules).

I don't see how separating ASDF is necessary or useful to make it appealing to other communities.

An ASDF workflow is different than a workflow using mseed/sac and stationxml, especially with parallel processing. My idea for the new package is to make ASDF documentation and examples more visible to potential users. I personally found it easier to use pyasdf as a stand alone package from obspy. Just my two cents..

Either way, if we keep ASDF in SeisIO or make a new package, I think as we get more use cases the onus should be on future package authors who use ASDF (such as myself with SeisNoise) to extend the auxiliarydata method for their own particular needs rather than Josh (less work for him).

jpjones76 commented 4 years ago

Oh, I understand now. Thank you for clarifying. The issue of the workflow I/O being different is an excellent point, but there's an underlying philosophical question: do we want SeisIO core to only teach users obsolete file I/O?

The geophysics of the future will use file formats like ASDF, not clumsy antiques like SEG Y or SEED. Even if ASDF gets replaced, the next file format will need to accommodate large volumes of long data segments, because that's what researchers use now. If ASDF remains in SeisIO core, we force new users to think about large volumes. This has other benefits: e.g., work flows that write each segment to one file vomit ~10^4 files all over the hard drive; ASDF won't. (This can be a major nuisance; e.g., on large SEED volumes, mseed2sac outputs enough files to break GNU ls.)

I agree that enabling HDF parallel read is different from anything I've done so far, but parallelization is already becoming part of SeisIO core. Much like large volumes, it seems like that's where research needs are going. Maybe we should talk separately about how and where paralellization gets incorporated?

tclements commented 4 years ago

That sounds good - I agree, ASDF needs to be heavily promoted over legacy data formats. Probably the best way to do this is well documented examples.

Let's discuss parallel processing in another thread.

jpjones76 commented 4 years ago

I've updated the tutorial to include a detailed section describing how to use ASDF read/write. The changes are live on dev and will go to master tomorrow. I've changed the wording of the tutorials so that ASDF is talked about much more than legacy data formats.

Question for @tclements : who should extend write_hdf5 to CorrelationData? Do you want to take that on, or should I do this one as a pull request? You know SeisNoise.jl better, but I have the most experience with ASDF; I can easily have it done this week.

tclements commented 4 years ago

If it would take you less than a few hours, go for it. Otherwise, I'd be happy to try it out and learn how hdf5 works under the hood.

jpjones76 / SeisIO.jl

Future ASDF expansion #26