Open shawnboltz opened 1 year ago
Ya, I have seen this before. I guess it begs the question: why is the mtime changing, and is that easier to avoid than implementing a solution?
One advantage of the bank is that we can update/query the index without having to load all the contents into memory. If we did check for duplicates, we would have to load the entire index into memory and check for duplicates (could be kinda costly if the index is large) before updates. I am not sure how much slower it would make file indexing, but at least a bit.
In this case I wanted to make sure that the existing files were overwritten because there was a change in how obspy reads/writes Kinemetrics EVT files and I wanted to make sure that any data in the bank that originated from an EVT file was consistent (number of samples/time window was the same, it was a matter of storing the data with/without the instrument response removed).
An alternative could have been to just delete and recreate the index, but the index is huge and it would have taken 24 hours+ (though I will probably end up doing that now, anyways).
I know you don't really like adding additional kwargs, but maybe it could be an optional step if it's a case where we know it will be necessary?
Description This is a pretty low priority, but it I can see it having some obscure effects on, ex., identifying gaps/availability in a bank.
When a waveform file gets updated (in this specific case, everything about the files were identical, just the modified time changed), a duplicate entry gets put into the WaveBank.
To Reproduce I admittedly haven't deliberately reproduced this, but I'm fairly certain this is what happened:
df = bank.read_index(station="sta1", channel="ELZ", location="02") assert df.duplicated().any() == False
bank.update_index()
df = bank.read_index(station="sta1", channel="ELZ", location="02") assert df.duplicated.any() == True