krischer commented 8 years ago

Hi all,

@sstaehler @martinvandriel @tnissen @alexhutko @CTrabant - pinging you all as its somehow relevant to all of us.

I spent some time last week and today trying to make Instaseis (and by extension) Syngine faster. So far I only looked at I/O performance as that is the dominant part and everything else can be made faster by just throwing more CPUs at it.

To extract a single 3 component seismogram we have to access data from 5 arrays across two netCDF files. At each array we have to get data for all 25 GLL points. Worst case this results in 125 read operations for each 3C seismogram.

Currently this data is stored in chunked netCDF files (more about chunks) at one chunk per GLL point time series (that's what makes it fairly performant even though data is accessed with a loop over the slow index). Nonetheless 125 chunks have to be found (HDF traverses a btree for that) and read.

This PR does a couple of things in an attempt to speed things up.

Replace Python netcdf4 with h5py, a Python package to read HDF5 files. Newer netCDF versions are based on HDF5 thus we essentially skip one layer of software which is always nice.
Batch I/O operations when using the current database structure. This is a bit faster likely due to less function call overheads and maybe some internal h5py/hdf5 optimizations.
Try different HDF5/netCDF structures/settings. I implemented a couple but only intend to keep one - otherwise the maintenance overhead is too much.

The most promising database restructurings are:

Transpose the big arrays (so the fast index is last - hdf5 uses C memory layout) and store it continuous and uncompressed in the files. Thus adjacent GLL time series can be read with a single read operation. The downside is that the files are a bit bigger (no compression) and things like check-sums don't work anymore.
Take all 5 arrays from the two files and restructure the file so the data for all GLL points from a single element can be acquired with just a single read operation. Thus data is stored in a single large 5D array (element id, time_axis, gll_id_i, gll_id_j, direction). This means that duplicate GLL points are no longer discarded - the file is also no longer chunked and thus not compressed. This results in much bigger files but also quite a bit faster.

Two things that might still be worthwhile to try are (I'll be pretty busy the next two weeks - please feel free to try it):

Write the 3 arrays from the horizontal and the 2 from the vertical DBs to single arrays in each file and adjust the chunking. This should already be quite a speedup without increasing the file size at all.
Use the 5D single file struture and adjust the chunking so a chunk is all the data from one element. This would enable compression and thus smaller files.

What trade-off between file size and extraction speed should we make? Also for the syngine use case. Is doubling the file size ok? Instaseis will always support the current database structure but I will also add support for one more layout. Which one is best?

The following measurements have to be taken with a large grain of salt. It is really hard to make these measurements as multiple layers of caches are active at any point in time (OS caches/HDF5 caches/Instaseis caches/...). If you try this, please clear the operating system disc buffers before making the measurements. Also use a database with a size of at least a couple times your system's memory - the buffers are surprisingly effective and small databases tint the result as we essentially only care about performance for the large databases.

Furthermore they are highly dependent on the system and I/O architecture and load.

I made the measurements with

python -m instaseis.benchmark --pattern UnbufferedFullyRandom --time=20 database_name

This branch also contains a tool to convert databases from the current layout to another layout:

python /../instaseis/scripts/repack_instaseis_database.py --method=unrolled_merge DB DB_unrolled

This can take a very long time!

Benchmarks

2s database over a pretty slow network NAS

Current DB Structure (1.1 TB) and current Instaseis Master:

    1.3531 sec/seismogram
    0.739046 seismograms/sec

Current DB Strucuture (1.1 TB) and h5py Branch:

    1.12009 sec/seismogram
    0.892785 seismograms/sec

Transposed and non-chunked DB (1.4 TB) and h5py Branch:

    0.355843 sec/seismogram
    2.81023 seismograms/sec

Unrolled and unchunked DB (2.2 TB) and h5py Branch:

    0.239713 sec/seismogram
    4.17166 seismograms/sec

5s mars database on an external USB hard disc

Current DB Structure (37 GB) and current Instaseis Master:

0.446772 sec/seismogram
2.23828 seismograms/sec

Current DB Strucuture (37 GB) and h5py Branch:

0.331634 sec/seismogram
3.01537 seismograms/sec

Transposed and non-chunked DB (54 GB) and h5py Branch:

0.208579 sec/seismogram
4.79434 seismograms/sec

Unrolled and unchunked DB (83 GB) and h5py Branch:

0.0630697 sec/seismogram
15.8555 seismograms/sec

sstaehler commented 8 years ago

Hi Lion,

thanks for these tests, which are highly relevant for kernel calculation as well. In HPC environments, the crucial point would be limiting the number of file accesses, which is served by your "unrolling" as well.

I would vote for your complete unrolling solution:

Take all 5 arrays from the two files and restructure the file so the data for all GLL points from a single element can be acquired with just a single read operation. Thus data is stored in a single large 5D array (element id, time_axis, gll_id_i, gll_id_j, direction). This means that duplicate GLL points are no longer discarded - the file is also no longer chunked and thus not compressed. This results in much bigger files but also quite a bit faster.

For the kernel code it could mean replacing 250 file accesses by one (in the best case), which is obviously highly beneficial.

I am still not quite convinced about the "transpose" question. As I understand it, NetCDF/HDF transposes Fortran arrays before writing them anyway.

sstaehler commented 8 years ago

Maybe one usability hint: The transformation script should probably take the original axisem_output.nc4 files as input, so that we save one level of intermediate files.

Also, the script could be moved to the AxiSEM repository.

krischer commented 8 years ago

I am still not quite convinced about the "transpose" question. As I understand it, NetCDF/HDF transposes Fortran arrays before writing them anyway.

I'm not sure what the Fortran APIs are doing but within HDF5 data is always stored in row major order. So yes - the conversion has to happen somewhere.

The first thing I tried was to use the h5repack tool and just turning off chunking and compression. Access was incredibly slow because even though the data has been transposed before it was written to the file the fast index of the data did not correspond to the fast index we require for the data extraction. The only reason that worked reasonably well with the existing databases is that the chunking forced the data to be written to disc in a different pattern.

Maybe one usability hint: The transformation script should probably take the original axisem_output.nc4 files as input, so that we save one level of intermediate files.

Also, the script could be moved to the AxiSEM repository.

Both are good points. I'll clean up the script once we decided on what to do.

sstaehler commented 8 years ago

Okay, another critical issue, I noticed is that the resulting file is not valid NetCDF4 file anymore.

I could not pinpoint the reason yet, but I assume it is due to missing dimensions in the output file. One of the two big differences between NetCDF4 and HDF5 is that the former uses named and shared dimensions (lat, lon, x, y, z, time, depth, etc), while the latter just uses array shapes. https://www.unidata.ucar.edu/blogs/developer/en/entry/dimensions_scales https://www.unidata.ucar.edu/software/netcdf/docs/interoperability_hdf5.html#reading_with_hdf5

Since the kerner will use NetCDF4 for the foreseeable future and performance-wise it is just as critical, I would like to have valid NetCDF4 files as an output of the script. I see two possibilities

Use the NetCDF4 module for the conversion script (we do not really care about the performance of the conversion script, do we?)
Attach a dimension scale to each output variable. That seems rather tedious.

martinvandriel commented 8 years ago

First thanks of all for this initiative. My feeling is we might want to move in two directions with different applications in mind:

Heavy IO applications like the Kerner, where speed is much more important than storage and where DBs are probably only stored temporarily.
Low IO long time applications like the Insight mission, where we need to store many DBs and have few access operations. Here, storage is the crucial part and even some additional CPU overhead with respect to the current master would be acceptable, if that allows additional compression.

My suggestion would be to have at maximum two storage patterns, that are optimized for these two directions.

Anyway, both can be done in postprocessing (currently called 'fieldtransform') in python and don't require changes in the Fortran code, so this keeps things relatively simple.

sstaehler commented 8 years ago

@martinvandriel: agreed, probably, we should move the functionality into fieldtransform.py. This would also solve the NetCDF4 problem.

The IRIS guys could then decide which storage pattern they prefer.

krischer commented 8 years ago

Okay, another critical issue, I noticed is that the resulting file is not valid NetCDF4 file anymore.

That should be easy to fix. I think I just have to copy the dimension scales of the datasets - they also exist in hdf5. I'll make sure the files stay valid netCDF files when I polish the script.

My suggestion would be to have at maximum two storage patterns, that are optimized for these two directions.

We will always support the current storage pattern of data as quite some people already have databases. And then a new fast one. Is that ok? We could probably optimize the current layout to be faster without requiring more space but then we'll have to support three patterns which I would like to avoid.

Here, storage is the crucial part and even some additional CPU overhead with respect to the current master would be acceptable, if that allows additional compression.

You could always use a higher gzip compression level or szip compression. Also make larger chunks which should compress better as compression is per chunk. As long as the logical HDF5 data layout does not change instaseis will continue to work fine.

sstaehler commented 8 years ago

Okay, another critical issue, I noticed is that the resulting file is not valid NetCDF4 file anymore. That should be easy to fix. I think I just have to copy the dimension scales of the datasets - they also exist in hdf5. I'll make sure the files stay valid netCDF files when I polish the script.

If it's easy to do in h5py, then I don't care which library is used. But it's also needed for all the mesh variables, which are now just copied in a loop. For testing: If

ncdump -hs file.nc4

can read it, it should be fine.

martinvandriel commented 8 years ago

You could always use a higher gzip compression level or szip compression. Also make larger chunks which should compress better as compression is per chunk.

These things don't really help, because floats look very much like random numbers, hence have high entropy and don't compress well (only the zeroes before p-arrival compress well). So the only promising strategy I see here would be to use e.g. wavelets or predictor/corrector methods and convert to integers to make things compressable.

sstaehler commented 8 years ago

Dear @alexhutko @CTrabant,

since another code would depend on these changes on my side, I'd like to reach a decision soon. What is your opinion on the trade-off between performance and disk space?

chad-earthscope commented 8 years ago

Well, based solely on Lion's tests I would have chosen a middle solution (Transposed and non-chunked) to get a performance gain without doubling the file size. Syngine doesn't really fit into either of Martin's target use cases in that we know the IO is heavy but storage size must be considered.

Let us get farther into testing the variations on our storage system and we'll have more information to provide. I expect to have some benchmarks in the next couple of days.

krischer commented 8 years ago

If you are willing to actually test on your storage system use this script: https://github.com/krischer/instaseis/blob/h5py/instaseis/scripts/repack_instaseis_database.py

It needs h5py, click, and numpy (all in conda and pip I think) and maybe Python 3.

Convert to both with:

python /path/to/repack_instaseis_database.py --method=continuous INPUT_DB OUTPUT_DB

and

python /path/to/repack_instaseis_database.py --method=unrolled_merge INPUT_DB OUTPUT_DB

The first one is fairly fast, the second one a bit slower. The second one as of now also only works with databases that have horizontal and vertical components.

And benchmark with the call I posted above or even better benchmark the syngine service if you have a benchmarking utility set up.

chad-earthscope commented 8 years ago

Status: we're converting a test database (ak135f_2s, du 1.5 TiB). The unrolled_merge is taking a very long time, "Copying group 'snapshots'" is at 16% after ~21 hours.

krischer commented 8 years ago

Hmm...that should not be. The snapshots groups should be copied almost immediately. And copying the rest should take around 20-30 hours - at least on our fairly slow network NAS.

It should spend almost all time in "Copying group 'gllpoints_all'".

How big is the file currently? If its less then a couple 100 GB please abort and restart. If it does not reach the 'gllpoints_all' step in a manner of like 60 seconds I might have to look at the script again. In that case: could you somehow give me access to your system? That would vastly simplify debugging. I could send you my public ssh key for example.

chad-earthscope commented 8 years ago

Current output for unrolled_merge is 372 GB. It is already past the gllpoints_all:

.../repack_instaseis_database.py --method=unrolled_merge ../MODELS/ak135f_2s ./ak135f_2s_unrolled_merge
    Copying group 'Surface'...
    Copying group 'Mesh'...
    Copying group 'gllpoints_all'...
    Copying group 'snapshots'...
        [######------------------------------]   16%  11:11:26

The continuous version took 11.3 hours to convert.

I'm reluctant to restart without changing anything as I seriously doubt it will be different (unless this is running in Windows-mode where restarting magically fixes things?). It is definitely using CPU and increasing the output volume. I can start another conversion for testing if needed.

krischer commented 8 years ago

Sorry my comment was mistaken - I've confused things. I just looked at the code and it copies all other groups before it starts to copy the actual data. The order for the other groups is non-deterministic so the output might vary from run to run...

As soon as the progressbar is visible is starts copying actual data values. The number at the end should be an ETA - maybe it is not meant to measure thing running longer than a day and it is missing the day specifier? I'll look into that when I polish the converter script to its final version once we choose which way to go.

I think all is good with your conversion - I'm just surprised it takes so long.

Yes the continuous version is much faster as it does fairly optimal chunked I/O.

I converted a 1.1 TB file with the unrolled_merge version and I unfortunately did not time it but it was done over one weekend. I'll try to speed this up if possible if we end up using that version and I think it might be possible - right now it jumps around in the file reading bits and pieces as it sees fit...

alexhutko commented 8 years ago

Ciao Lion,

I'm not sure what's going on, but maybe it's bug reporting here... I tried repacking to method==continuous on the ak135f_5s database and it barfs in the same place in two separate runs w py2 & py3. I had no problems repacking a smaller database.

Alex

[product1@dpstage REPACK_TEST]$ time /products/axisem/instaseis.beta/anaconda3/bin/python /products/axisem/instaseis.beta/instaseis-repo/instaseis/scripts/repack_instaseis_database.py --method=continuous ak135f_5s ak135f_5s_continuous_chad --> Processing file 1 of 2: ak135f_5s/PX/Data/ordered_output.nc4 Copying group 'snapshot_times'... Copying group 'components'... Copying group 'Seismograms'... Copying group 'snapshots'... Copying group 'sim_timesteps'... Copying group 'Mesh'... Copying group 'seis_timesteps'... Copying group 'Surface'... Copying group 'gllpoints_all'... Copying 'Snapshots/disp_s' (1 of 5)... [####################################] 100%
Copying 'Snapshots/disp_z' (2 of 5)... [####################################] 100%
Copying 'Snapshots/stf_dump' (3 of 5)... Traceback (most recent call last): File "/products/axisem/instaseis.beta/instaseis-repo/instaseis/scripts/repack_instaseis_database.py", line 289, in repack_database() File "/products/axisem/instaseis.beta/anaconda3/lib/python3.4/site-packages/click/core.py", line 664, in call return self.main(_args, _kwargs) File "/products/axisem/instaseis.beta/anaconda3/lib/python3.4/site-packages/click/core.py", line 644, in main rv = self.invoke(ctx) File "/products/axisem/instaseis.beta/anaconda3/lib/python3.4/site-packages/click/core.py", line 837, in invoke return ctx.invoke(self.callback, _ctx.params) File "/products/axisem/instaseis.beta/anaconda3/lib/python3.4/site-packages/click/core.py", line 464, in invoke return callback(_args, **kwargs) File "/products/axisem/instaseis.beta/instaseis-repo/instaseis/scripts/repack_instaseis_database.py", line 285, in repack_database process_file(filename, output_filename, method=method) File "/products/axisem/instaseis.beta/instaseis-repo/instaseis/scripts/repack_instaseis_database.py", line 60, in process_file group, shape=(ds_i.shape[1], ds_i.shape[0]), IndexError: tuple index out of range 672.548u 93.885s 18:01.34 70.8% 0+0k 60528944+102307024io 99pf+0w

krischer commented 8 years ago

Hi Alex,

this appears to be an older version of the databases files :-( I pushed a untested fix to this branch. Can you try again? If it does not work I'll try to get my hands on such a file.

The unrolled_merge variant currently also does not work for vertical only databases. I'll add that functionality if we choose to go down that road.

alexhutko commented 8 years ago

Hi Lion,

Your fix works for converting the older databases. However, the instaseis.benchmark fails on their untouched and "continuous" versions. It works fine for "unrolled" and "unrolled_merge".

Is there possible fix for this?

thanks, ALex

[product1@dpstage REPACK_TEST]$ /home/product1/anaconda/bin/python -m instaseis.benchmark --pattern UnbufferedFullyRandom --time=30 prem_a_10s /products/SYNGINE_ALEX/INSTASEIS_Jan2016/instaseis-h5py-2/instaseis/init.py:100: UserWarning: Please don't install from a tarball. Use the proper pypi release or install from git.

"release or install from git.", UserWarning)

Instaseis Benchmark Suite

It enables to gauge the speed of Instaseis for a certain DB.

It does not deal with OS level caches! So interpret the results accordingly!

Traceback (most recent call last): File "/home/product1/anaconda/lib/python2.7/runpy.py", line 162, in _run_module_as_main "main", fname, loader, pkg_name) File "/home/product1/anaconda/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/products/SYNGINE_ALEX/INSTASEIS_Jan2016/instaseis-h5py-2/instaseis/benchmark/main.py", line 391, in db = open_db(path, read_on_demand=True, buffer_size_in_mb=0) File "/products/SYNGINE_ALEX/INSTASEIS_Jan2016/instaseis-h5py-2/instaseis/init.py", line 92, in open_db return instaseis_db.InstaseisDB(path, _args, *_kwargs) File "/products/SYNGINE_ALEX/INSTASEIS_Jan2016/instaseis-h5py-2/instaseis/instaseis_db.py", line 66, in init self._find_and_open_files() File "/products/SYNGINE_ALEX/INSTASEIS_Jan2016/instaseis-h5py-2/instaseis/instaseis_db.py", line 117, in _find_and_open_files self._parse_fs_meshes(netcdf_files) File "/products/SYNGINE_ALEX/INSTASEIS_Jan2016/instaseis-h5py-2/instaseis/instaseis_db.py", line 161, in _parse_fs_meshes read_on_demand=self.read_on_demand) File "/products/SYNGINE_ALEX/INSTASEIS_Jan2016/instaseis-h5py-2/instaseis/mesh.py", line 115, in init self._create_mesh_dict() File "/products/SYNGINE_ALEX/INSTASEIS_Jan2016/instaseis-h5py-2/instaseis/mesh.py", line 175, in _create_mesh_dict time_axis = get_time_axis(value) File "/products/SYNGINE_ALEX/INSTASEIS_Jan2016/instaseis-h5py-2/instaseis/mesh.py", line 161, in get_time_axis if ds.shape[0] == ds.shape[1]: IndexError: tuple index out of range

chad-earthscope commented 8 years ago

Testing results with ak135_2s database on IRIS beta system (same storage system as production). OS cache flushed prior to each run to avoid testing the buffer capability.

Test command: .../python -m instaseis.benchmark --pattern UnbufferedFullyRandom --time=20 <MODEL>

Production Instaseis: Instaseis 0.1.1, Anaconda 2.3.0, Python 3.4.3, numpy 1.9.3, netcdf4 1.1.9

New Instaseis: Instaseis PR 35, Anaconda 2.4.1, Python 3.4.4, numpy 1.10.1, h5py 2.5.0

ak135f_2s, 1.5 TiB, Production Instaseis with original DB:

    10.158 sec/seismogram
    0.098445 seismograms/sec

Not quite sure how to interpret the result, it certainly does not take 10 seconds to produce a seismogram in Syngine.

ak135f_2s, 1.5 TiB, New Instaseis with original DB:

    4.5054 sec/seismogram
    0.221956 seismograms/sec

ak135f_2s_continuous, 1.5 TiB, New Instaseis:

    0.660914 sec/seismogram
    1.51306 seismograms/sec

ak135f_2s_unrolled_merge, 2.3 TiB, New Instaseis:

    0.239049 sec/seismogram
    4.18324 seismograms/sec

In summary, the new software alone makes a nice difference. The migration to the continuous layout makes another nice performance jump without needing any more storage (for this DB at least). If this can be believed we can get an order of magnitude increase without an increase in volume. The final test is a large increase in volume with increased performance.

I wish we could get support for both the continuous and unrolled_merge, then we could use the later on popular and high priority models and use the former for the rest. If I were forced to choose one I go with continuous.

krischer commented 8 years ago

Before I comment any further on this: Can you run the benchmarks for a longer duration? You just might have been unlucky and ran into some OS or I/O hickup and only got two samples. The --time parameter is the total approximate runtime in seconds. This particular benchmark tests the worst case with random accesses all over the database and it should give you the time for each 3 component seismogram - the server component would later add a bit more overhead.

Do the benchmarks now work for the continuous version?

chad-earthscope commented 8 years ago

Regarding the continuous version: I am using a different database than @alexhutko did. Apparently not all of the "old" databases are equal.

I ran the tests multiple times each. But to satisfy, below are the tests for 300 seconds, aka: .../python -m instaseis.benchmark --pattern UnbufferedFullyRandom --time=300 <MODEL>

Same definitions of Production and New Instaseis as above.

ak135f_2s, 1.5 TiB, Production Instaseis with original DB:

    10.2389 sec/seismogram
    0.0976667 seismograms/sec

ak135f_2s, 1.5 TiB, New Instaseis with original DB:

    5.17566 sec/seismogram
    0.193212 seismograms/sec

ak135f_2s_continuous, 1.5 TiB, New Instaseis:

    0.647945 sec/seismogram
    1.54334 seismograms/sec

ak135f_2s_unrolled_merge, 2.3 TiB, New Instaseis:

    0.245925 sec/seismogram
    4.06629 seismograms/sec

Virtually the same as before.

As before, OS cache was cleared before each run using echo 3 > /proc/sys/vm/drop_caches.

alexhutko commented 8 years ago

Here’s my testing using the same test command and my own version of python and the Feb 3 version of instaseis PR35. I ran these on a slower disk storage system.

Unlike Chad’s test w the TB scale database, I find the continuous version of the databases are larger, e.g. 95 vs 122gb for ak135f_5s. Clean OS cache vs repeated tests only seemed to affect the very small 20s database.

Python 2.7.11 |Anaconda 2.1.0 (64-bit)| (default, Dec 6 2015, 18:08:32) , numpy 1.10.2, h5py 2.5.0

model	size (gb)	seis/s	seis/s	seis/s
prem_a_20s	0.6	4.1	14	28
prem_a_20s_continuous	1.2	45	115	133
prem_a_20s_unrolled	1.8	23	68	92
prem_a_20s_unrolled_merge	1.8	36	62	76
prem_a_10s (5hr duration)	70	xx
prem_a_10s_continuous	80	xx
prem_a_10s_unrolled	123	3.2	3.4	3.6
prem_a_10s_unrolled_merge	123	3.6	4.1	3.9
ak135f_5s (1hr duration)	95	xx
ak135f_5s_continuous	122	xx
ak135f_5s_unrolled	189	5.1	5.7	5.1
ak135f_5s_unrolled_merge	189	7.2	5.6	6.8

Here are the models for which I’m able to run a benchmark test (and presumably convert to a new database format):

ak135f_2s 1.5TB generated by AxiSEM version SVN_VERSION at 2015-11-14T19:27:05.000000Z prem_a_20s 0.6GB generated by AxiSEM version 60945ec at 2014-10-23T15:35:34.000000Z prem_a_2s 1.5TB generated by AxiSEM version SVN_VERSION at 2015-12-10T18:42:23.000000Z prem_i_2s 1.4TB generated by AxiSEM version SVN_VERSION at 2015-12-01T10:12:24.000000Z

Models that I can not benchmark, convert to continuous, or even instaseis.open_db using instaseis PR35. These can convert to unrolled & unrolled_merge.

ak135f_1s 1.4TB generated by AxiSEM version SVN_VERSION at 2015-11-19T09:34:56.000000Z ak135f_5s 95GB generated by AxiSEM version v1.2-31-g69d4-dirty at 2015-11-18T12:09:17.000000Z iasp91_2s 1.4TB generated by AxiSEM version NYR1.2 at 2015-03-18T10:03:05.000000Z prem_a_10s 70GB generated by AxiSEM version v1.2-31-g69d4-dirty at 2015-11-12T09:36:02.000000Z prem_a_5s 200GB generated by AxiSEM version v1.2-31-g69d4-dirty at 2015-11-16T22:36:06.000000Z

Is this a significant issue wft moving to one type of database, i.e. will the ‘old’ style databases need to be recomputed?

As things look right now, I think it may be in our best interest to go with lower volume options and accept smaller speed increases than going whole hog. This will allow us to have more models. At this time, slow performance and too many users doesn’t seem to be an issue. I’m happy to do further testing if there’s a patch of some sorts.

chad-earthscope commented 8 years ago

I reran my tests with ak135f_2s with a 1200 second duration with the same results. In short, those are the numbers, if something unwanted is biasing them it is very, very consistent.

krischer commented 8 years ago

Hi guys,

thanks for the tests!

Not quite sure how to interpret the result, it certainly does not take 10 seconds to produce a seismogram in Syngine.

I'm fairly certain that the benchmarks measure the time for a single 3 component seismogram extraction so I would be surprised if it is really wrong. Are you sure that Syngine is much faster? To actually test you have to completely randomize the source and receiver geometry all the time as the internal instaseis buffers are really efficient in some scenarios.

I wrote a small script that tests the performance of syngine. You can find it here: https://gist.github.com/krischer/9d9a4bd078a8e64675f2

Results are measured from Munich but the script accounts for ping. Also the results are almost identical if run from some server in New York. The times are TTFB e.g. the time until the first byte of the response is received (with and without subtracting the ping). They seem quite consistent with your measurements. Do you have other measurement and is there an obvious error in my script?

Model  ak135f_2s         :  min/avg/max/stddev =  7.84/11.04/14.02/ 1.84 sec
  --> accounting for ping:  min/avg/max/stddev =  7.64/10.85/13.82/ 1.84 sec

Model  ak135f_5s         :  min/avg/max/stddev =  1.40/ 1.92/ 2.97/ 0.45 sec
  --> accounting for ping:  min/avg/max/stddev =  1.20/ 1.72/ 2.77/ 0.45 sec

Model  iasp91_2s         :  min/avg/max/stddev =  2.80/ 3.23/ 3.88/ 0.30 sec
  --> accounting for ping:  min/avg/max/stddev =  2.61/ 3.03/ 3.68/ 0.30 sec

Model prem_a_10s         :  min/avg/max/stddev =  1.71/ 1.89/ 2.11/ 0.12 sec
  --> accounting for ping:  min/avg/max/stddev =  1.51/ 1.69/ 1.91/ 0.12 sec

Model prem_a_20s         :  min/avg/max/stddev =  0.76/ 0.93/ 1.09/ 0.10 sec
  --> accounting for ping:  min/avg/max/stddev =  0.56/ 0.73/ 0.90/ 0.10 sec

Model  prem_a_2s         :  min/avg/max/stddev = 10.55/12.28/13.39/ 0.91 sec
  --> accounting for ping:  min/avg/max/stddev = 10.35/12.08/13.20/ 0.91 sec

Model  prem_a_5s         :  min/avg/max/stddev =  1.56/ 2.30/ 4.11/ 0.79 sec
  --> accounting for ping:  min/avg/max/stddev =  1.36/ 2.11/ 3.91/ 0.79 sec

Model  prem_i_2s         :  min/avg/max/stddev =  8.10/10.73/14.55/ 1.94 sec
  --> accounting for ping:  min/avg/max/stddev =  7.90/10.54/14.35/ 1.94 sec

Unlike Chad’s test w the TB scale database, I find the continuous version of the databases are larger, e.g. 95 vs 122gb for ak135f_5s.

That should also happen for the larger databases. The difference in size between the current and the continuous layout is that the later has no compression. I suspect that Chad's database is just not compressed in the first case.

i.e. will the ‘old’ style databases need to be recomputed?

No...Instaseis will definitely keep supporting the 'old' style databases.

As things look right now, I think it may be in our best interest to go with lower volume options and accept smaller speed increases than going whole hog. This will allow us to have more models. At this time, slow performance and too many users doesn’t seem to be an issue. I’m happy to do further testing if there’s a patch of some sorts. I wish we could get support for both the continuous and unrolled_merge, then we could use the later on popular and high priority models and use the former for the rest. If I were forced to choose one I go with continuous.

As written above I think you just got an oddball of a database. The relative sizes should be about old_style (100% size), continuous (125-140% size), and unrolled_merged (~200% size).

I personally would like to keep the old_style (we have to) and the unrolled_merge variant. We thus would have a small and slow version and a large and fast one. Also for me there is not a big practical difference between 1TB and 2TB but I understand that your requirements are different.

@sstaehler also needs the unrolled_merge variant.

I hope to have some time next week and will try to find a sane way to incorporate multiple database layouts in a maintainable way. If possible I guess we can support both new variants if really necessary but I would have to rely a bit on your testing. If not we will have to discuss again.

alexhutko commented 8 years ago

@krischer, I ran some tests and and the results are about the same to your performance script, just a tad faster across the board, as expected for an internal user. I think the confusion arises from us saying syngine is ~2s/trace for the full scale models. That's only one component and under our typical usage case where either the station or the source is repeated and thus cached/faster.

Here is a speed test using FFMs (minus the overhead of source processing) and if you agree that these make sense, it's a strong case for supporting the continuous version of the databases. I'm guessing this is because for FFM traces, instaseis is extracting from very similar locations of the database? Perhaps the totally randomized tests we've been running aren't the best route for thinking about which database format is superior since most uses (at least for IRIS users) will be single source, multiple receivers where it looks like continuous > unrolled_merge? I suppose the conversion to continuous being faster than unrolled_merge and the smaller volume is just icing on the cake...

Thoughts?

http://earthquake.usgs.gov/earthquakes/eventpage/usb000jyiv#scientific_finite-fault (about 200 subaults)

model	sec/trace (fresh run)	sec/trace (repeated)
ak135f_2s	165	145
ak135f_2s_continuous	10	8
ak135f_2s_unrolled_merge	40	29

alexhutko commented 8 years ago

I wasn't sure if the totally random or FFM testing were the best way to determine what would suit IRIS's needs, so I ran some tests that are event-based where I used 15 random sources where each had 50 random receivers. I think this is more typical use cases than purely random src-rec pairs...

model	time per 3 (ZNE) traces (s)	with resampling to 0.1Hz (s)
ak135f_2s	10	12
ak135f_2s_continuous	1.1	1.7
ak135f_2s_unrolled_merge	0.4	0.9

I think from syngine's perspective, if both continuous and unrolled_merge were supported in instaseis, we'd go with the continuous option for now since the lower volume will allow us to hold more models.

Whatever route you have chosen, do you have a timeline estimate for when you'll feel comfortable releasing an update to instaseis?

The increased performances are quite impressive.

krischer commented 8 years ago

I was planning on doing this tomorrow and depending on how it goes the day after. The current plan is find some abstraction layer to support different database layouts. If that works (which I think it should) we can easily support a large number of different database layouts. I personally think that having fewer but faster databases has more merits than a larger number of slower ones but if we support a number of different layouts you can mix and match as you please and adapt to possibly changing demands :-)

I wasn't sure if the totally random or FFM testing were the best way to determine what would suit IRIS's needs, so I ran some tests that are event-based where I used 15 random sources where each had 50 random receivers. I think this is more typical use cases than purely random src-rec pairs...

I agree that a random test is a much better proxy. I'm also not sure why the FFM test results in a faster continuous layout. The unrolled_merged one should always be strictly faster. I suspect that maybe the buffers don't work with the unrolled_merged one - i'll check tomorrow.

In general I think the best test for performance are completely random accesses. As soon as a couple users are accessing the service in parallel that is the only metric that really counts.

BTW - this might come in handy for performance testing: the instaseis benchmarking tool is quite comprehensive and can simulate a number of different loads/usage patterns. That should probably be documented a bit better.

Emulate a Finite Source

python -m instaseis.benchmark --pattern FiniteSourceEmulation /path/to/db

Emulating a Source Inversion

python -m instaseis.benchmark --pattern BufferedHalfDegreeLatLngDepthScatter /path/to/db

Fully Random Access (Highest I/O Load)

python -m instaseis.benchmark --pattern UnbufferedFullyRandom /path/to/db

Perfectly Utilized Buffers (Least I/O Load)

python -m instaseis.benchmark --pattern BufferedFixedSrcRecRoDOffSeismogramGeneration /path/to/db

Essentially every class here is a benchmark: https://github.com/krischer/instaseis/blob/master/instaseis/benchmark/__main__.py

You could easily add your own.

krischer commented 8 years ago

Hi guys,

I'm very sorry for the delay - other more pressing issues demanded my attention. I'll do this ASAP (might take till the week after next week because we are organizing a workshop next week).

If somebody want to give this a shot in the meanwhile:

We currently have a Mesh class that does not do a lot more than offering shortcuts to things in the netcdf files: https://github.com/krischer/instaseis/blob/master/instaseis/mesh.py

The core instaseis routines still directly access the netcdf files by using pointers/references on the mesh object: https://github.com/krischer/instaseis/blob/master/instaseis/instaseis_db.py

The idea is now to add methods to the Mesh class so that the core functions just call these. Thus instead of for example directly accessing the corner points of the elements they call a method get_corner_points(). This now allows us to create various specializations of the Mesh class; one for each database format. We could even have non netcdf databases but I guess there is no practical use for that as of now. I'm confident this will work and it grants us support for any number of current and future database layouts with reasonable developmental efforts.

krischer commented 8 years ago

Continued in #42.

krischer / instaseis

Optimizations #35

Benchmarks

2s database over a pretty slow network NAS

Current DB Structure (1.1 TB) and current Instaseis Master:

Current DB Strucuture (1.1 TB) and h5py Branch:

Transposed and non-chunked DB (1.4 TB) and h5py Branch:

Unrolled and unchunked DB (2.2 TB) and h5py Branch:

5s mars database on an external USB hard disc

Current DB Structure (37 GB) and current Instaseis Master:

Current DB Strucuture (37 GB) and h5py Branch:

Transposed and non-chunked DB (54 GB) and h5py Branch:

Unrolled and unchunked DB (83 GB) and h5py Branch:

"release or install from git.", UserWarning)

It enables to gauge the speed of Instaseis for a certain DB.

Emulate a Finite Source

Emulating a Source Inversion

Fully Random Access (Highest I/O Load)

Perfectly Utilized Buffers (Least I/O Load)