Closed tclements closed 3 years ago
If all you want is the number of gaps in a file, then you don't need to read the entire file into memory. I can come up with a workaround that will scan each record and simply skip the blockettes. The pseudocode would be this, more or less:
ngaps
, an Array{Int,1}
; end_times
, an Array{Int64,1}
; ids
, an Array{Array{UInt8,1},1}
.Array{UInt8,1}
, call it id
. Compare to each array in ids
; if new, append to ids
, then append a 0 to ngaps
. Store index as j
. end_times[j]
. If there's a gap, increment ngaps[j]
.seek
past all blockette data.This is much faster than read_data
because the latter spends most of its time decompressing blockette data. It also avoids string comparisons.
Were you thinking to parallelize the read and use an async
/wait()
approach? I suspect that passing data between CPUs will be slower than looping over records, unless you can think of a way to to divide files into arbitrary-length chunks. The latter is hard, though, because we don't know where new records begin in the file. You'd need a reliable test to identify the start of a new record; I know of none.
Your solution seems the most efficient. My solution was to do an async
read and an async sleep(t)
, then to kill the read task if the sleep task completed. I didn't get very far because it seems killing tasks in Julia is still under discussion and not recommended https://github.com/JuliaLang/julia/issues/6283
Here is an example file with ~60,000 gaps that takes about 20 seconds to read on my desktop:
I'm working on this at present. I'm adding a scan_seed
wrapper to the SEED submodule, which includes this. A wrapper will leave room for other scan types in the future, e.g., I can imagine users wanting a list of blockettes in each record.
Depending on how fast the scanner runs, I can try to code it to run in "background" (on a different CPU) with a keyword; I don't want to make that default behavior, though.
Does that sound ok?
This is awesome!
Trial run on my laptop. All I can say is "...lol". I knew Steim wasn't quick, but I wasn't expecting this much improvement.
julia> @benchmark read_data("mseed", fname)
BenchmarkTools.Trial:
memory estimate: 80.66 GiB
allocs estimate: 537515
--------------
minimum time: 25.790 s (3.12% GC)
median time: 25.790 s (3.12% GC)
mean time: 25.790 s (3.12% GC)
maximum time: 25.790 s (3.12% GC)
--------------
samples: 1
evals/sample: 1
julia> @benchmark scan_seed(fname)
BenchmarkTools.Trial:
memory estimate: 1.41 KiB
allocs estimate: 21
--------------
minimum time: 17.568 ms (0.00% GC)
median time: 17.732 ms (0.00% GC)
mean time: 17.820 ms (0.00% GC)
maximum time: 25.180 ms (0.00% GC)
--------------
samples: 281
evals/sample: 1
All right, so I can add code to background this to another CPU, but I'm not sure we need to. At worst there will be another few ms to parse IDs and generate nicely-formatted String outputs as I finish up the wrapper.
Aside: my memory allocation for the mseed reader is fking terrible** on this file. I don't understand what happened to my scaling. Gaps alone can't explain it. I suspect I need to turn off garbage collection. I'll investigate that more tomorrow.
Oh, and the output from my "alpha" version of scan_seed:
julia> fname = "/data/Downloads/CIGATR_HHZ___2017085.ms"
julia> (ngaps, ids) = scan_seed(fname)
([60048], Array{UInt8,1}[[0x47, 0x41, 0x54, 0x52, 0x20, 0x20, 0x20, 0x48, 0x48, 0x5a, 0x43, 0x49]])
julia> String(copy(ids[1]))
"GATR HHZCI"
julia> S = read_data("mseed", fname)
SeisData with 1 channels (1 shown)
ID: CI.GATR..HHZ
NAME: CI.GATR..HHZ
LOC: 0.0 N, 0.0 E, 0.0 m
FS: 100.0
GAIN: 1.0
RESP: a0 1.0, f0 1.0, 0z, 0p
UNITS:
SRC: /data/Downloads/CIGATR_HHZ___2017…
MISC: 0 entries
NOTES: 1 entries
T: 2017-03-26T00:00:00 (60048 gaps)
X: -3.800e+01
-1.090e+02
...
-3.140e+02
(nx = 17280000)
C: 0 open, 0 total
Wow, great work! This is going to make a huge difference for getting consistent read times on big datasets.
Last night I pushed scan_seed
to master as an addition to SeisIO.SEED. I wasn't sure whether or not you wanted output printed to stdout, so there are two options: tabulated results in stdout by default,, or pass flag quiet=true
to only return a String array with one String per channel.
I don't yet have an option to "background" this, though I'm sure I could with @async
if need be. It runs so quickly that I'm not sure you'll need/want to. Here's a comparison from my laptop, running Julia 1.5.1 on Ubuntu 20.04:
julia> fname = "/data/Downloads/CIGATR_HHZ___2017085.ms"
"/data/Downloads/CIGATR_HHZ___2017085.ms"
julia> @benchmark scan_seed(fname, quiet=true)
BenchmarkTools.Trial:
memory estimate: 2.19 KiB
allocs estimate: 35
--------------
minimum time: 6.599 ms (0.00% GC)
median time: 6.916 ms (0.00% GC)
mean time: 6.963 ms (0.00% GC)
maximum time: 9.203 ms (0.00% GC)
--------------
samples: 718
evals/sample: 1
julia> @benchmark read_data(fname)
BenchmarkTools.Trial:
memory estimate: 80.66 GiB
allocs estimate: 537728
--------------
minimum time: 30.064 s (5.49% GC)
median time: 30.064 s (5.49% GC)
mean time: 30.064 s (5.49% GC)
maximum time: 30.064 s (5.49% GC)
--------------
samples: 1
evals/sample: 1
I'm still looking into why read_data performs so poorly on that file. I've never seen it over-allocate memory to that degree before; that's 2-3 orders of magnitude worse than any other file we've tested, including your group's other samples.
Trying out scan_seed
, came across odd problem where scan_seed
gets the number of gaps wrong for any file before read_data
is called but correct for any file after read_data
is called:
julia> using SeisIO, SeisIO.SEED
julia> file = "/home/ubuntu/data/continuous_waveforms/2008/2008_001/CIADO__HHZ___2008001.ms"
"/home/ubuntu/data/continuous_waveforms/2008/2008_001/CIADO__HHZ___2008001.ms"
julia> scan_seed(file,quiet=true)[1]
"CI.ADO..HHZ, nx = 1079497, ngaps = 4752, nfs = 1"
julia> S = read_data("mseed",file)
SeisData with 1 channels (1 shown)
ID: CI.ADO..HHZ
NAME: CI.ADO..HHZ
LOC: 0.0 N, 0.0 E, 0.0 m
FS: 100.0
GAIN: 1.0
RESP: a0 1.0, f0 1.0, 0z, 0p
UNITS:
SRC: /home/ubuntu/data/continuous_wave…
MISC: 0 entries
NOTES: 1 entries
T: 2008-01-01T00:00:00 (4 gaps)
X: +4.160e+02
+4.130e+02
...
+2.770e+02
(nx = 8639564)
C: 0 open, 0 total
julia> scan_seed(file,quiet=true)[1]
"CI.ADO..HHZ, nx = 8639564, ngaps = 4, nfs = 1"
I can confirm that scan_seed
works on other files after read_data
is called. I'm guessing this has something to do with SeisIO.BUF
but haven't investigated any further.
Hmmm, I'm clearly failing to reset something in :BUF
... maybe multiple things. I'll check this soon.
Found it. Testing a fix now.
I split the bug you found into a new issue; cause has been found, fixed, and pushed to dev.
I'm going to do a minor version release of SeisIO soon, so that scan_seed
is included. This was a great idea, and it's proving extremely useful. Thank you for the great suggestion.
This is more of a discussion on mseed than an issue.
I have some mseed files with > 10,000 gaps in a dataset of ~3 million files.
read_data
takes about 20 seconds on the files with 10,000+ gaps each. This is expected and totally fine but I'd like to avoid reading files such as this in the future when I process the rest of the dataset. I can't determine which files are bad a priori.I'm looking for a way to get a rough estimate of the number of gaps in a file. Does this still require reading every blockette? Looking through
parserec!
, I'm not sure where to begin.My current solution is to implement something like discussed here https://github.com/JuliaLang/julia/issues/36217 on top of
read_data
.