Model Retraining Data Sampling Rate

Bryanrt-geophys commented 3 years ago

Hello Mustafa,

I have written a code that will take the downloaded mseed files from EQTransformer, convert them to sac, splice them into 60-second intervals with a consistent 7-second buffer before a documented p-wave, pull the attributes out of legacy data from our by-hand system (SEISMOS), and write the hdf5 and csv file for retraining. However, while looking through your paper on STEAD format, it looks like you changed the sample rate for each station to 100 Hz for an exact 6000 samples per _EV object. Can EQTransformer work with data with other sampling rates (12000 samples) or do I need to make this conversion?

Also, the preprocessor function EQTransformer also appears to split the mseed data into 60-second intervals but it does not call on the training model the user provides to the trainer function. Could this introduce an issue with the index values used to assign the phase labels starting from different points, the point that I cut my SAC files at, and the point that the preprocessor cut the mseed values from? If so, should I skip the preprocessor step in the EQTransformer workflow?

I believe SEISMOS is a pretty archaic system that is not used much anymore, but I am planning on publishing my code as a package in R for converting SEISMOS data sets into EQTransformer training models. Hopefully, this could make the transition between systems more streamlined in the future for others. If you have any comments on the matter I would love t0o have your input.

Best, Bryan

smousavi05 commented 3 years ago

@Bryanrt-geophys Hi Bryan, You need to band-pass filter the data first and then resample it to 100Hz. You can check the preprocessor module of EqT as an example.

What the preprocessor function of EqT does is to make the downloaded data with the same requirement of the training data. The only difference is that the output Hdfs files of the preprocessor do not have labels (P , S arrivals , etc). The way you cut your data is fine and should not make a problem. The point that you cut your data is okay, you need to make sure that P and S waves exist in each 60 s window. The way preprocessor cut the data is different because in the test data we do not know where the earthquakes and P and S waves are. So we cut different windows with some overlapping in a hope that some windows contains all of the necessary info that triggers the model. You need to just apply the band-pass filtering, tapering, downsampling part of the preprocessing (the signal processing parts).

Yeah that would be helpful for others.

Regards, Mostafa

Bryanrt-geophys commented 3 years ago

Hello Mostafa,

Thank you so much for the explanation. You will have to be patient with my ignorance here, but if I have a sampling rate of 200 Hz for some stations, can I delete every other data point to simulate a 100 Hz sampling rate? I will dig into the preprocessing function to try to get at how to perform the processing steps.

I have also hit a wall on performing the snr_db calculation. I am looking at the equation in your STEAD paper but I am not fully understanding the explanation. It sounds like I need to isolate the data between an s-wave and the next p-wave. I then need to look at the distribution of amplitudes over that data set and take the 95th percentile and use that for either my S or N value? Does this sound like I'm heading in the right direction? I am not sure how to know the difference between S and N though. If this is all explained in the paper and I missed it, feel free to tell me to just go read it again. I just thought I would ask.

Thanks again for your help! Bryan

smousavi05 commented 3 years ago

Hi @Bryanrt-geophys you can do that as well but Obspy has multiple functions for resampling. That should be a better option as it handel other cases as well.

It might be jut easier to explain if by a coder. This is a helper function from EqT that calculates the SNR:

def _get_snr(data, pat, window = 200):

""" 

Estimates SNR.

Parameters
----------
data: NumPy array
    3 component data.     
pat: positive integer
    Sample point where a specific phase arrives.  
window: positive integer
    The length of the window for calculating the SNR (in the sample).         

Returns
-------   
snr : {float, None}
   Estimated SNR in db.   

"""      

snr = None
if pat:
    try:
        if int(pat) >= window and (int(pat)+window) < len(data):
            nw1 = data[int(pat)-window : int(pat)];
            sw1 = data[int(pat) : int(pat)+window];
            snr = round(10*math.log10((np.percentile(sw1,95)/np.percentile(nw1,95))**2), 1)           
        elif int(pat) < window and (int(pat)+window) < len(data):
            window = int(pat)
            nw1 = data[int(pat)-window : int(pat)];
            sw1 = data[int(pat) : int(pat)+window];
            snr = round(10*math.log10((np.percentile(sw1,95)/np.percentile(nw1,95))**2), 1)
        elif (int(pat)+window) > len(data):
            window = len(data)-int(pat)
            nw1 = data[int(pat)-window : int(pat)];
            sw1 = data[int(pat) : int(pat)+window];
            snr = round(10*math.log10((np.percentile(sw1,95)/np.percentile(nw1,95))**2), 1)    
    except Exception:
        pass
return snr

smousavi05 / EQTransformer

Model Retraining Data Sampling Rate #55