Short circuit for reading mseed with many gaps

tclements commented 4 years ago

This is more of a discussion on mseed than an issue.

I have some mseed files with > 10,000 gaps in a dataset of ~3 million files. read_data takes about 20 seconds on the files with 10,000+ gaps each. This is expected and totally fine but I'd like to avoid reading files such as this in the future when I process the rest of the dataset. I can't determine which files are bad a priori.

I'm looking for a way to get a rough estimate of the number of gaps in a file. Does this still require reading every blockette? Looking through parserec!, I'm not sure where to begin.

My current solution is to implement something like discussed here https://github.com/JuliaLang/julia/issues/36217 on top of read_data.

jpjones76 commented 3 years ago

If all you want is the number of gaps in a file, then you don't need to read the entire file into memory. I can come up with a workaround that will scan each record and simply skip the blockettes. The pseudocode would be this, more or less:

Initialize ngaps, an Array{Int,1}; end_times, an Array{Int64,1}; ids, an Array{Array{UInt8,1},1}.
Parse the (48-byte) "Fixed Section of Data Header".
Read bytes 9-20 into an Array{UInt8,1}, call it id. Compare to each array in ids; if new, append to ids, then append a 0 to ngaps. Store index as j.
Get record start time, number of samples, and fs.
Compare to end_times[j]. If there's a gap, increment ngaps[j].
Determine blockette size in bytes and seek past all blockette data.
Repeat until EOF.

This is much faster than read_data because the latter spends most of its time decompressing blockette data. It also avoids string comparisons.

Were you thinking to parallelize the read and use an async/wait() approach? I suspect that passing data between CPUs will be slower than looping over records, unless you can think of a way to to divide files into arbitrary-length chunks. The latter is hard, though, because we don't know where new records begin in the file. You'd need a reliable test to identify the start of a new record; I know of none.

tclements commented 3 years ago

Your solution seems the most efficient. My solution was to do an async read and an async sleep(t), then to kill the read task if the sleep task completed. I didn't get very far because it seems killing tasks in Julia is still under discussion and not recommended https://github.com/JuliaLang/julia/issues/6283

tclements commented 3 years ago

Here is an example file with ~60,000 gaps that takes about 20 seconds to read on my desktop:

CIGATR.zip

jpjones76 commented 3 years ago

I'm working on this at present. I'm adding a scan_seed wrapper to the SEED submodule, which includes this. A wrapper will leave room for other scan types in the future, e.g., I can imagine users wanting a list of blockettes in each record.

Depending on how fast the scanner runs, I can try to code it to run in "background" (on a different CPU) with a keyword; I don't want to make that default behavior, though.

Does that sound ok?

tclements commented 3 years ago

This is awesome!

jpjones76 commented 3 years ago

Trial run on my laptop. All I can say is "...lol". I knew Steim wasn't quick, but I wasn't expecting this much improvement.

julia> @benchmark read_data("mseed", fname)
BenchmarkTools.Trial: 
  memory estimate:  80.66 GiB
  allocs estimate:  537515
  --------------
  minimum time:     25.790 s (3.12% GC)
  median time:      25.790 s (3.12% GC)
  mean time:        25.790 s (3.12% GC)
  maximum time:     25.790 s (3.12% GC)
  --------------
  samples:          1
  evals/sample:     1

julia> @benchmark scan_seed(fname)
BenchmarkTools.Trial: 
  memory estimate:  1.41 KiB
  allocs estimate:  21
  --------------
  minimum time:     17.568 ms (0.00% GC)
  median time:      17.732 ms (0.00% GC)
  mean time:        17.820 ms (0.00% GC)
  maximum time:     25.180 ms (0.00% GC)
  --------------
  samples:          281
  evals/sample:     1

All right, so I can add code to background this to another CPU, but I'm not sure we need to. At worst there will be another few ms to parse IDs and generate nicely-formatted String outputs as I finish up the wrapper.

Aside: my memory allocation for the mseed reader is fking terrible** on this file. I don't understand what happened to my scaling. Gaps alone can't explain it. I suspect I need to turn off garbage collection. I'll investigate that more tomorrow.

jpjones76 commented 3 years ago

Oh, and the output from my "alpha" version of scan_seed:

julia> fname = "/data/Downloads/CIGATR_HHZ___2017085.ms"

julia> (ngaps, ids) = scan_seed(fname)
([60048], Array{UInt8,1}[[0x47, 0x41, 0x54, 0x52, 0x20, 0x20, 0x20, 0x48, 0x48, 0x5a, 0x43, 0x49]])

julia> String(copy(ids[1]))
"GATR   HHZCI"

julia> S = read_data("mseed", fname)
SeisData with 1 channels (1 shown)
    ID: CI.GATR..HHZ                       
  NAME: CI.GATR..HHZ                       
   LOC: 0.0 N, 0.0 E, 0.0 m                
    FS: 100.0                              
  GAIN: 1.0                                
  RESP: a0 1.0, f0 1.0, 0z, 0p             
 UNITS:                                    
   SRC: /data/Downloads/CIGATR_HHZ___2017… 
  MISC: 0 entries                          
 NOTES: 1 entries                          
     T: 2017-03-26T00:00:00 (60048 gaps)   
     X: -3.800e+01                         
        -1.090e+02                         
            ...                            
        -3.140e+02                         
        (nx = 17280000)                    
     C: 0 open, 0 total

tclements commented 3 years ago

Wow, great work! This is going to make a huge difference for getting consistent read times on big datasets.

jpjones76 commented 3 years ago

Last night I pushed scan_seed to master as an addition to SeisIO.SEED. I wasn't sure whether or not you wanted output printed to stdout, so there are two options: tabulated results in stdout by default,, or pass flag quiet=true to only return a String array with one String per channel.

I don't yet have an option to "background" this, though I'm sure I could with @async if need be. It runs so quickly that I'm not sure you'll need/want to. Here's a comparison from my laptop, running Julia 1.5.1 on Ubuntu 20.04:

julia> fname = "/data/Downloads/CIGATR_HHZ___2017085.ms"
"/data/Downloads/CIGATR_HHZ___2017085.ms"

julia> @benchmark scan_seed(fname, quiet=true)
BenchmarkTools.Trial: 
  memory estimate:  2.19 KiB
  allocs estimate:  35
  --------------
  minimum time:     6.599 ms (0.00% GC)
  median time:      6.916 ms (0.00% GC)
  mean time:        6.963 ms (0.00% GC)
  maximum time:     9.203 ms (0.00% GC)
  --------------
  samples:          718
  evals/sample:     1

julia> @benchmark read_data(fname)
BenchmarkTools.Trial: 
  memory estimate:  80.66 GiB
  allocs estimate:  537728
  --------------
  minimum time:     30.064 s (5.49% GC)
  median time:      30.064 s (5.49% GC)
  mean time:        30.064 s (5.49% GC)
  maximum time:     30.064 s (5.49% GC)
  --------------
  samples:          1
  evals/sample:     1

I'm still looking into why read_data performs so poorly on that file. I've never seen it over-allocate memory to that degree before; that's 2-3 orders of magnitude worse than any other file we've tested, including your group's other samples.

tclements commented 3 years ago

Trying out scan_seed, came across odd problem where scan_seed gets the number of gaps wrong for any file before read_data is called but correct for any file after read_data is called:

julia> using SeisIO, SeisIO.SEED

julia> file = "/home/ubuntu/data/continuous_waveforms/2008/2008_001/CIADO__HHZ___2008001.ms"
"/home/ubuntu/data/continuous_waveforms/2008/2008_001/CIADO__HHZ___2008001.ms"

julia> scan_seed(file,quiet=true)[1]
"CI.ADO..HHZ, nx = 1079497, ngaps = 4752, nfs = 1"

julia> S = read_data("mseed",file)
SeisData with 1 channels (1 shown)
    ID: CI.ADO..HHZ                        
  NAME: CI.ADO..HHZ                        
   LOC: 0.0 N, 0.0 E, 0.0 m                
    FS: 100.0                              
  GAIN: 1.0                                
  RESP: a0 1.0, f0 1.0, 0z, 0p             
 UNITS:                                    
   SRC: /home/ubuntu/data/continuous_wave… 
  MISC: 0 entries                          
 NOTES: 1 entries                          
     T: 2008-01-01T00:00:00 (4 gaps)       
     X: +4.160e+02                         
        +4.130e+02                         
            ...                            
        +2.770e+02                         
        (nx = 8639564)                     
     C: 0 open, 0 total

julia> scan_seed(file,quiet=true)[1]
"CI.ADO..HHZ, nx = 8639564, ngaps = 4, nfs = 1"

I can confirm that scan_seed works on other files after read_data is called. I'm guessing this has something to do with SeisIO.BUF but haven't investigated any further.

jpjones76 commented 3 years ago

Hmmm, I'm clearly failing to reset something in :BUF... maybe multiple things. I'll check this soon.

jpjones76 commented 3 years ago

Found it. Testing a fix now.

jpjones76 commented 3 years ago

I split the bug you found into a new issue; cause has been found, fixed, and pushed to dev.

jpjones76 commented 3 years ago

I'm going to do a minor version release of SeisIO soon, so that scan_seed is included. This was a great idea, and it's proving extremely useful. Thank you for the great suggestion.

jpjones76 / SeisIO.jl

Short circuit for reading mseed with many gaps #62