obspy / obspy

ObsPy: A Python Toolbox for seismology/seismological observatories.
https://www.obspy.org
Other
1.17k stars 536 forks source link

Increase in file size after instrument correction. #353

Closed trac2github closed 12 years ago

trac2github commented 12 years ago

Hi, I have used obspy to correct some data. I get the data as a miniseed file through arcLink with a filesize of 1.8M. After merging the data and doing the correction, then decimating by a factor of 2 and filtering above 1Hz the filesize goes up to about 7.5M. Anyone know why this is. Thanks, David

trac2github commented 12 years ago

[megies] During instrument correction the data type of the numpy array is converted to float. You can check these things e.g. like {{{

!python

from obspy.core import read st = read() tr = st[0] print tr.data.dtype # show data type print tr.data.itemsize # show number of bytes taken in memory by a single sample }}}

trac2github commented 12 years ago

[dcdavemail@gmail.com] Hi Megies, both the raw data and the corrected data are of type float 64 and take 8 bytes for a single sample?

trac2github commented 12 years ago

[megies] I can not think of any other reason. Sorry, I cannot help you without any reproducable example.

trac2github commented 12 years ago

[anonymous] I attached the script I used, if you could try it I'd appreciate it.

trac2github commented 12 years ago

[megies] I had to adapt some filenames/paths and am running into IOErrors with non existing files. Please submit a script with one specific case without any loops, I do not have the time to play around with complicated programs right now.

trac2github commented 12 years ago

[anonymous] Hi again, Seems I was mistaken the raw data is float32 and the corrected float64. The increase in file size still seems a bit excessive though. If you're too busy it can wait as there is not much I can do about it today anyway. Thanks, David

trac2github commented 12 years ago

[megies] Ok, I have had a look at it. However I can not see any problem here (actually the raw data is int32). Correct me anybody, but miniseed is using totally different encoding/compression algorithms for int/float data types so I guess it is not at all surprising to see varying compression efficiency.

I think you just have to live with this.

best, Tobias

trac2github commented 12 years ago

[anonymous] ok thanks

trac2github commented 12 years ago

[anonymous] what happens if you just write it back to int32 by append this to your script: {{{ for tr in st: tr.data = np.require(tr.data, np.int32) st.write('int32.mseed', 'MSEED') }}}

if the size of the file is still significant larger than its not just the conversion of int32 to float64 ...

trac2github commented 12 years ago

[krischer] Hello.

You do not really want to that because any filtered/corrected/... data will most likely not be full integers anymore so you would corrupt your data.

I did not check, but should the incoming data not be int32?

And yes, MiniSEED does not pack float data at all and just writes the raw binary numbers to the file. Integer numbers on the other hand are (for "quiet" data at least) packed quite efficiently with a best case compression ratio of, I believe, almost 1:7 for STEIM2 compression. So a filesize increase of 200% (you have 400% because you store the data as float64) is still quite within bounds.

The filesize also depends, although to a lesser degree, on the record length. Larger record lengths will result in a smaller file because the header is written less often.

In your case I would just convert the data to float32 and store it with encoding 4 (float32^^).

Best wishes,

Lion

trac2github commented 12 years ago

[anonymous] Lion,

I know ;) I just wanted to figure if he stores the data again in int32 (which was the original data type) if this results into significant larger files - if this is the case than its not the conversion of int/float which increases the file size - instead there is something different going on - e.g. change of sampling rate etc

Robert

trac2github commented 12 years ago

[krischer] Heyhey,

ah ok. I didn't think of that possibility.

There are also some other things that could happen, like if there is a significant gap between the traces before they are merged with the interpolate option which would actually create new data.

I just tried the included example and it all seems fine to me. I don't believe we have an issue here. The data is actually decimated with a factor of two before being corrected but it is still stored as uncompressed float64.

So the filesize increase is due to two/three factors:

Best wishes,

Lion

trac2github commented 12 years ago

[megies] I think we can close this..

trac2github commented 12 years ago

[anonymous] Hi again, sorry I didn't get back sooner but its been a busy week. The data gaps are very small and storing the data as int32 reduces the filesize to 600K so I think Lion's summary is correct. thanks anyone who commented :) D

trac2github commented 12 years ago

[megies] No problem. We're happy if !ObsPy keeps being useful for you. Keep in mind, you have any processing routines that could be useful to others, drop us a line.

best, Tobias