niosh-mining / obsplus

A Pandas-Centric ObsPy Expansion Pack
GNU Lesser General Public License v3.0
38 stars 8 forks source link

WaveBank duplicates entries in index if file mtime is updated #268

Open shawnboltz opened 1 year ago

shawnboltz commented 1 year ago

Description This is a pretty low priority, but it I can see it having some obscure effects on, ex., identifying gaps/availability in a bank.

When a waveform file gets updated (in this specific case, everything about the files were identical, just the modified time changed), a duplicate entry gets put into the WaveBank.

To Reproduce I admittedly haven't deliberately reproduced this, but I'm fairly certain this is what happened:

  1. A bank got created for a directory of files
    
    import obsplus
    bank = obsplus.WaveBank("path/to/bank")
    bank.update_index()

df = bank.read_index(station="sta1", channel="ELZ", location="02") assert df.duplicated().any() == False

2. A set of files got copied into the bank, some of which were files that already in the bank (and the originals were overwritten)
3. The index was updated and duplicated entries got created

bank.update_index()

df = bank.read_index(station="sta1", channel="ELZ", location="02") assert df.duplicated.any() == True



**Expected behavior**
The duplicate entries shouldn't get created if all else is equal about the files. (This raises a bigger issue about what happens if the file's contents change substantially, but that's a problem for another day.)

I'm not entirely sure if this would work with the structure of update_index, but simply dropping duplicates should resolve the issue.

**Versions (please complete the following information):**
 - OS: Ubuntu 22.04
 - ObsPlus Version: 0.2.3.dev10+g8e231f4
 - Python Version: 3.8.12
d-chambers commented 1 year ago

Ya, I have seen this before. I guess it begs the question: why is the mtime changing, and is that easier to avoid than implementing a solution?

One advantage of the bank is that we can update/query the index without having to load all the contents into memory. If we did check for duplicates, we would have to load the entire index into memory and check for duplicates (could be kinda costly if the index is large) before updates. I am not sure how much slower it would make file indexing, but at least a bit.

shawnboltz commented 1 year ago

In this case I wanted to make sure that the existing files were overwritten because there was a change in how obspy reads/writes Kinemetrics EVT files and I wanted to make sure that any data in the bank that originated from an EVT file was consistent (number of samples/time window was the same, it was a matter of storing the data with/without the instrument response removed).

An alternative could have been to just delete and recreate the index, but the index is huge and it would have taken 24 hours+ (though I will probably end up doing that now, anyways).

I know you don't really like adding additional kwargs, but maybe it could be an optional step if it's a case where we know it will be necessary?