smousavi05 / EQTransformer

EQTransformer, a python package for earthquake signal detection and phase picking using AI.
https://rebrand.ly/EQT-documentations
MIT License
301 stars 148 forks source link

STEAD data formatting and station component recognition #48

Closed Bryanrt-geophys closed 3 years ago

Bryanrt-geophys commented 3 years ago

STEAD

I am attempting to wrangle our data outputs into the STEAD format to create our own training model. Are all variables in the format needed or can some columns be left with NA's?

Station Component Recognition

Also, when I downloaded the station list via the makeStationList(), the list shows that a number of the stations in my list are 3 component stations. I made sure to select my desired channels to match the components listed in the station_list when I run downloadMseeds(). However, when I view the pot that color codes each stations data over time as red for 3 component and purple for 1 component, all of my stations appear to be 1 component. I am a bit confused as the station_list said manny of the stations were 3 component and I selected those channels in the function as instructed. Any advice?

smousavi05 commented 3 years ago

@Bryanrt-geophys you don't need all columns, just those that are used by EqT trainer module like P and S arrival times, SNR, etc. Those plots shows what has been actually downloaded. Your stations of interest might be 3C but sometimes due to networks traffic, API issues, etc only some of the channels get downloaded. I don't know the actual reason for this and how to fix it. It is an IRIS/Obspy issue. Sometimes trying at different times results in different sets of channels. I designed that plot for checking this actually.

Bryanrt-geophys commented 3 years ago

Thank you very much for letting me know to try some other date ranges to trouble shoot the component detection issue. I am still struggling with building a training model. I have gone through the files trainer() function and have not been able to determine which columns from the 100samples.csvfile are being used in the training. I have attached a CSV file with each column from the 100samples.csv file as a row, and two additional columns documenting what files I have found from our systems output that contains relevant data to fill those the respective variable, and If those variables are necessary for the trainer() function. Would you mind taking a quick look at this and lit me know which I must have?

Aslo, the trainer() function shows that I need an hdf5 file to pair with my provided csv of sample events. I am unfamiliar with what information would be in that hdf5 file, and how I go about getting that data. Is this a file that just describes the p, s, and coda? Is the hdf5 file something unique to the events I provide for training or can the hdf5 file used from the tutorial be recycled here? If not, could you give me some guidance on how to go about getting this data? Is this what the source_id is for? In your paper "STanford EArthquake Dataset (STEAD): A Global Data Set of Seismic Signals for AI", you mention using the source_id to procure metadata from IRIS. Is that what you were referring to?

matched_data.xlsx

smousavi05 commented 3 years ago

@Bryanrt-geophys these are my suggestions: 1) keep the same number of columns and column names as in the 100samples.csv file. 2) the columns that you need for the training are: p_arrival_sample, s_arrival_sample, snr_db, coda_end_sample, and trace_name. You can fill the other columns with Nan. But don't forget that you need to stick to exact format for the required columns. For instance, the trace names should be in stationName.NetworkName_time_EV (for earthquake example) and _NO (for noise example). DataGenerator class in EqT_unitls.py is where these info are used.

Hdf5 file contains the main waveform data that are used for the training. You basically put all of your data in this file plus all the labels in the csv file. You need to put all the data that you want to use for the training in the hdf5 file. So it can be only your data or a combination of those with STEAD ones. No, these are your own earthquake and/or non-earthquake signals that you want to train the EqT on it. The waveform in STEAD are already used for training of the current model.

Bryanrt-geophys commented 3 years ago

Okay, so within the hdf5 files, I should store the trace for each event at the respective station that the csv files will label. Is this correct? I often use R for data wrangling and I am starting a tutorial on hdf5 files in R. Would you recommend managing hdf5 file creation through a designated software or is an open source language fine?

Bryanrt-geophys commented 3 years ago

Okay, so I was able to read the hdf5 into R and saw that there were matrices for each source_id. I plotted each row of data within a matrix and saw that it is indeed the signal that being passed to the function in the hdf5 files and the csv seems to be merged through the source_id. I am not sure though why each source_id matrix has three different rows that appear to be unique events. The source_id is unique to an event at a singular station right? I don't believe the matrix is the signal for each of your stations for a singular event, correct?

Bryanrt-geophys commented 3 years ago

Oh, does this have something to do with the sensors being 3 component? Sorry to be thinking outloud here. As I said, I am not a seismologist so most of this information is foreign and I'm catching on as I go. Thanks for being patient and so helpful.

smousavi05 commented 3 years ago

@Bryanrt-geophys right, traces are stored in the hdf5. This is a way to bundle all the tiny files basically. You need to create a dataset in hdf5 file for each event waveform (3 component traces at each station) and add all the labels -that you would have on your csv file as well - as attributes to each dataset. Below I have an example for creating both the hdf5 and csv files from STEAD in python:

from os import listdir, walk from os.path import isfile, join, isdir import h5py import csv import numpy as np

from os import listdir import os import h5py import matplotlib.pyplot as plt import numpy as np

this is the hdf5 file we want to create

HDF0 = h5py.File("dataset7/waveforms_04_24_20.hdf5", 'a')

creating two groups for earthquake and noise signals

HDF0.create_group("earthquake/local") HDF0.create_group("non_earthquake/noise")

this write the first row (columns' labels) into the csv file.

csvfile = open('dataset7/metadata_04_24_20.csv', 'w')
output_writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL) output_writer.writerow(['network_code','receiver_code','receiver_type','receiver_latitude','receiver_longitude', 'receiver_elevation_m','p_arrival_sample','p_status','p_weight','p_travel_sec', 's_arrival_sample','s_status','s_weight', 'source_id','source_origin_time','source_origin_uncertainty_sec', 'source_latitude','source_longitude','source_error_sec', 'source_gap_deg','source_horizontal_uncertainty_km', 'source_depth_km', 'source_depth_uncertainty_km', 'source_magnitude', 'source_magnitude_type', 'source_magnitude_author','source_mechanism_strike_dip_rake', 'source_distance_deg', 'source_distance_km', 'back_azimuth_deg', 'snr_db', 'coda_end_sample', 'trace_start_time', 'trace_category', 'trace_name']) csvfile.flush()

here I read in my noise waveforms that I want to write into the new hdf5 file. These are alreadt in a hdf5 format and a part of STEAD.

HDF2 = h5py.File("new_noise.hdf5", "r") nois2 = HDF2['noise']

here I write each waveform (a numpy array - a matrix) into the hdf5 file

for ns2 in nois2: x = nois2[ns2] print(ns2) dat = np.array(x)

HDFr = h5py.File("dataset7/waveforms_04_24_20.hdf5", 'r')
dsF = HDFr.create_dataset("non_earthquake/noise/"+ns2, dat.shape, data=dat, dtype=np.float32)  

# afer we wrote the 3c traces into the hdf5 as a dataset, we add the labels as attributes to it
dsF.attrs['network_code'] = x.attrs['network_code']
dsF.attrs['receiver_code'] = x.attrs['receiver_code']
dsF.attrs['receiver_type'] = x.attrs['receiver_type']
dsF.attrs['receiver_latitude'] = x.attrs['receiver_latitude']
dsF.attrs['receiver_longitude'] = x.attrs['receiver_longitude']
dsF.attrs['receiver_elevation_m'] = x.attrs['receiver_elevation_m']
dsF.attrs['trace_category'] = x.attrs['trace_category'] 
dsF.attrs['trace_name'] = x.attrs['trace_name']
HDFr.flush() 

# here we also write the labels into the csv file
output_writer.writerow([x.attrs['network_code'], 
                        x.attrs['receiver_code'], 
                        x.attrs['receiver_type'], 
                        x.attrs['receiver_latitude'], 
                        x.attrs['receiver_longitude'], 
                        x.attrs['receiver_elevation_m'], 
                        None, None, None, None, None, None, None,
                        None, None, None, None, None, None, 
                        None, None, None, None, None, None, None, 
                        None, None, None, None, None, None, None, 
                        x.attrs['trace_category'], 
                        x.attrs['trace_name']]);            
csvfile.flush()
Bryanrt-geophys commented 3 years ago

Thank you so much for the code. I have a few last questions. I'm the sample data set the 100sample.hdf5 only had one group for _EV so I am assuming the algorithms ability to find the data within the hdf5's is pretty robust. However, in the _EV objects, the signals that are listed for a singular events 3C's at a station are uniformly 3x6000 in size. Are these traces cut into segments of 6000 deci-seconds (60 sec) and is that a required sizing? Also is the ordering of the channels in each objects trace rows a rigid format (e.g. row1 NHZ, row2 HHZ, row3 EHZ) and do the rows and columns of the object require names for the algorithm to detect them? I tried using a name() function in the objects in R but it returned the general designation R provides so unnamed rows and columns. I want sure how to check this in python. Beside these question, I have successfully built a function that creates hdf5 files with a group for_EV's and one for_NO's with the correlative attributes populated from my data. Thank you for all your help! As soon as I know the constraints on the structure of the trace data, I should be able to test it out.

smousavi05 commented 3 years ago

Yes the 60 s trace size is required. The ordering for the components needs to be the same as well. The columns do not need name.

Bryanrt-geophys commented 3 years ago

Hello @smousavi05 ,

I have ran into 3 more questions:

  1. After looking at the data we have used historically, we have only saved SAC files containing traces from the vertical components of the stations in our area of study. My intent was to pass the SAC data for these traces to the hdf5 format for model training. However, in your training model data set, the data were all 3c sets. As I said I am not a seismologist, so I have a question. Could I download the msead data from iris for the timespan I am using to train our model, use the msead2sac program iris provides to get the 3c SAC files, and pass those to our hdf5 file? I am unsure if this will work because the manual picks that I would be assigning as attributes in the hdf5 file were just performed on the vertical component of the trace and I don't know if that data can be assigned to all 3c's or if that is making up data. Should I just stick with the vertical component data I have?

  2. I do not fully understand the values in the <p/s/coda_end>_arrival_samples columns. Are these a measurement of when the phase arrives at the sensor, trimmed by the p-wave? All of your values for the p_arrival_samples are an even multiple of 100, while the s and coda_end samples are uneven. Is this further explained anywhere?

  3. The SAC files our network has used are cut to be 3 min's long but it looked like yours are 60's. Do I need to recut all of our traces to 60 seconds? If so, to catch the coda_end all traces, I imagine I need to cut the traces as close to the p arrival as possible but if I cut too close to the p arrival I expect this to cause troubles in the training. Any recommendations?