Closed aringler-usgs closed 5 years ago
I made a few modifications to Adam's code and re-ran it on the same 101 days of ANMO data I used in the original timing test. Observations
1) I am still seeing the Pickles as being ~150% larger than the flat files. Hourly Pickly PSDs are ~350K while text files are 240k. I still get roughly 25TB for text files and now 37 TB for pickles to do all of TA (assuming 2000 stations and 2 years of data).
2) I can confirm that the data request is now the bottle neck, however, I am only see it be a factor of 3-6 slower than the PSD calculation.
3) The revised code is slower and during my run had data completeness problems. The original code returned 100% of PSDs and this one only returned 59.6%. The issue seems to be during the repeated requests from the DMC (lots of exceptions thrown in a row). The original code returned all PSDs in 39 Minutes and this code took 54 minutes.
Hey I am getting the following: For 1 hour webservice requests:
It took 0.02124500274658203 s to write PSDs
It took 1.92600679397583 s to load in all the data
It took 0.03880476951599121 s to calculate the PSDs
Here is the txt file size:322369
Here is the pickle size:131224
It took 0.02248692512512207 s to write PSDs
It took 1.4095520973205566 s to load in all the data
It took 0.04191398620605469 s to calculate the PSDs
Here is the txt file size:322429
Here is the pickle size:131224
It took 0.02286505699157715 s to write PSDs
It took 1.3334143161773682 s to load in all the data
It took 0.04757404327392578 s to calculate the PSDs
Here is the txt file size:322331
Here is the pickle size:131224
The entire time for a 1 hour webservice request.
It took 73.65430402755737 s to do the entire computation
For a 1 day request I get the following:
It took 26.282557010650635 s to do the entire computation
For 2 day requests I start time get timeout issues.
You might make sure you are testing on the same windows that I am.
Here is a second look at them:
-rw-r--r-- 1 aringler staff 128K Aug 13 08:40 PSD_IU_ANMO_BHZ_2019_00208.pckl
-rw-r--r-- 1 aringler staff 315K Aug 13 08:40 PSD_IU_ANMO_BHZ_2019_00208.txt
-rw-r--r-- 1 aringler staff 128K Aug 13 08:40 PSD_IU_ANMO_BHZ_2019_00209.pckl
-rw-r--r-- 1 aringler staff 315K Aug 13 08:40 PSD_IU_ANMO_BHZ_2019_00209.txt
-rw-r--r-- 1 aringler staff 128K Aug 13 08:40 PSD_IU_ANMO_BHZ_2019_00210.pckl
-rw-r--r-- 1 aringler staff 315K Aug 13 08:40 PSD_IU_ANMO_BHZ_2019_00210.txt
-rw-r--r-- 1 aringler staff 128K Aug 13 08:40 PSD_IU_ANMO_BHZ_2019_00211.pckl
-rw-r--r-- 1 aringler staff 315K Aug 13 08:40 PSD_IU_ANMO_BHZ_2019_00211.txt
-rw-r--r-- 1 aringler staff 128K Aug 13 08:40 PSD_IU_ANMO_BHZ_2019_00212.pckl
-rw-r--r-- 1 aringler staff 315K Aug 13 08:40 PSD_IU_ANMO_BHZ_2019_00212.txt
-rw-r--r-- 1 aringler staff 128K Aug 13 08:41 PSD_IU_ANMO_BHZ_2019_00213.pckl
-rw-r--r-- 1 aringler staff 315K Aug 13 08:41 PSD_IU_ANMO_BHZ_2019_00213.txt
-rw-r--r-- 1 aringler staff 128K Aug 13 08:41 PSD_IU_ANMO_BHZ_2019_00214.pckl
So we agree on 2. However, 1 and 3 still show disagreement. Perhaps you can identify the source of error on these two and then assuming the timing is okay we can move forward.
We are done with 1) and left with 3).
*with python3.
3) Adam's new changes to the code crushes it. Re-ran 2017 on N4 station T42B and it ran it 46 minutes. It took my original code 51.3 Hours. Assuming all stations go like this, all of TA should take ~12k-15k hours of computation time for all components. If we use 20 cores on our server, I get a little under a month of processing time.
The code also returns ~93% of the PSDs that we would expect if the station had 100% uptime (it probably doesn't). One little issue to sort out will be ensuring that each PSD estimate begins exactly on the hour. I will take a stab at this.
We probably will also want to set up some database/file directory to store the outputs. And perhaps an additional script to query this database.
I think I have fixed the issue of every PSD starting exactly on the hour.
Need to scrub location codes as right now the code wild cards from the inventory. The locations codes are sometimes flipped in the stream, so the PSD database is a mix of loc codes. Want a consistent location code.
I would just do something like this, but switch to location codes.
def scrub_inventory(inv):
for net in inv:
for sta in net:
chans_remove = []
for chan in sta:
if chan.code in ['HN1', 'HN2', 'HNZ', 'HNE', 'HNN']:
chans_remove.append(chan)
if chan.code in ['LH1', 'LH2', 'LHZ', 'LHN', 'LHE']:
chans_remove.append(chan)
if chan.code in ['VH1', 'VH2', 'VHZ', 'VHN', 'VHE']:
chans_remove.append(chan)
if chan.code in ['UH1', 'UH2', 'UHZ', 'UHN', 'UHE']:
chans_remove.append(chan)
for chan in chans_remove:
try:
sta.channels.remove(chan)
except:
pass
return inv
Could you run using the following:
screen
python PSD*.py
crtl a + d
Ran the first test run on Friday evening (8/23/2019). Issues Found so far:
1) When running a test case on a year, do not try to run over all of TA. Data does not exist for many of these stations due to the ~2 year emplacement of each TA site. Suggest "test case" flag on a small subset of Alaska TA stations.
2) PSD directories (line 39) do not appear to have been created for stations which did have data. The only files which appear to have been generated on this test run were log files
1. When running a test case on a year, do not try to run over all of TA. Data does not exist for many of these stations due to the ~2 year emplacement of each TA site. Suggest "test case" flag on a small subset of Alaska TA stations.
2. PSD directories (line 39) do not appear to have been created for stations which did have data. The only files which appear to have been generated on this test run were log files
Both of these things are done.
Check out additional test runs (Find more bugs) @reanthony-usgs
When I switch to LH* I get the following ValueError: noverlap must be less than nperseg.
@reanthony-usgs
Need to add rerequest mechanism. @aringler-usgs
Need to add new mapping using cartopy @aringler-usgs
Need to add parsing/summary code methods.
4 is done - nfft must be shortened when using LH* data
Crashing on TA ALASKA 2018 run:/usr/lib/python2.7/site-packages/obspy-1.1.1-py2.7-linux-x86_64.egg/obspy/clients/fdsn/client.py:867: UserWarning: Unknown Error (BadStatusLine): ''
warnings.warn(str(e))
/usr/lib/python2.7/site-packages/obspy-1.1.1-py2.7-linux-x86_64.egg/obspy/core/stream.py:3032: UserWarning: No matching response information found.
warnings.warn(str(e))
Traceback (most recent call last):
File "./PSD_Compression_Test.py", line 150, in
I think I fixed this bug. It was an issue with grabbing bad metadata. Try it out again.
Check out additional test runs (Find more bugs) @reanthony-usgs
When I switch to LH* I get the following ValueError: noverlap must be less than nperseg.
@reanthony-usgs
Need to add rerequest mechanism. @aringler-usgs
Need to add new mapping using cartopy @aringler-usgs
Need to add parsing/summary code methods.
Another bug running all of TA LHZ. This was during the day with a pool of 30 processors. The load was in the low 30s.
Traceback (most recent call last):
File "./PSD_Compression_Test.py", line 150, in
I have the following bug when I run your code:
File "noise_map.py", line 9, in <module>
per, psd = utils.get_psd_stats(path, sta, stime, etime, 10.)
File "/home/aringler/data_stuff/US_array/US_Noise_PSDs/utils.py", line 42, in get_psd_stats
a = pickle.load(pickle_file)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x96 in position 0: ordinal not in range(128)
I am testing this on J20K
I also ran into the following issue:
directs = path.split('/')
f = directs[-1]
info = f.split('_')
if debug:
print(info)
year, time = info[5], info[6]
day, hour = time[0:3],time[3:5]
This is coded for your path not a general path.
Also, please move the collection of PSDs to /TEST_ARCHIVE_OLD
We need to add a flexible path to the PSD computation code.
All these issues have been dealt with. I am closing this.
We need to check the following things. I will do 4, 5, and 6, once Rob checks 1, 2, and 3 again.
1) I am seeing the pckls are a third smaller than the txt files. Around 128K while the text are 302K. Can you verify this is what you are finding and in the right ballpark? What would the new size calculation be?
2) I am finding the data request to be about a factor of 100 slower than the PSD calculation. Can you confirm this?
3) We should check the timing with reorganized code. Do my changes speed things up?
4) We need to deal with multiple location codes (rare, but could throw off calculations).
5) We should parallelize this (I can do this once you verify 1, 2, and 3).
6) We should include a log file for problematic pulls from IRIS (I can do this).