niosh-mining / obsplus

A Pandas-Centric ObsPy Expansion Pack
GNU Lesser General Public License v3.0
38 stars 8 forks source link

WaveBank doesn't update when new files are added to the directory. #275

Open lvanderlaat opened 7 months ago

lvanderlaat commented 7 months ago

Description

After I add new mseed files to the base_path, a new run of my program won't recognize the new data.

To Reproduce

My code:

client = WaveBank(base_path).update_index()

print(client.get_availability_df())

I run it once, and it scans perfectly.

If I add data to the folder, the second time I run it, it won't show the new data.

Expected behavior

The output dataframe should show the new data.

Versions (please complete the following information):

d-chambers commented 7 months ago

Hi @lvanderlaat,

Yes there is a subtly here. For a new file to get indexed it doesn't just have to be moved into the wavebank directory, but its mtime has to be greater than the timestamp stored in the index of the last indexing time.

Currently the implementation is rather simple; we scan through all the files in the directory and just index the ones which have mtime greater than the last time the index operation was run. If we wanted to be sure to include new files in the directory, regardless of any timestamps in the file, we would have to load an array of the entire bank contents into memory then compare each file to determine if it was already in the index or not. In addition to increased complexity, I suspect doing it this way would slow down indexing considerably.

I suppose rather than mtime, which does not reset when you move a file, we could look using ctime which does. Honestly, I would need to read up on the differences and think through it more before deciding that is the right path. For example, if you moved some files around inside the bank they would then get re-added to the index, which may not be desirable.

So, if you are systematically moving files into your bank, just make sure to reset the mtime. You can use path.Path.touch for this. For example:

import shutil
from pathlib import Path

path1 = Path("...")  # path outside the bank directory
path2 = Path("...")  # path inside the bank directory

shutil.move(path1, path2)
path2.touch()

Give that a try and let me know if they files are now added to your bank.

shawnboltz commented 7 months ago

@d-chambers

I suppose rather than mtime, which does not reset when you move a file, we could look using ctime which does. Honestly, I would need to read up on the differences and think through it more before deciding that is the right path. For example, if you moved some files around inside the bank they would then get re-added to the index, which may not be desirable.

Two questions:

  1. If you move the files around within the bank, wouldn't it then be desirable to delete the old entry and add the new one to the index? Unless you "move" them to the same spot, the bank will no longer be able to find them.
  2. Would a longer-term strategy be to both check the ctime (assuming it behaves how we think) and then also check if the data are already in the index and revise it accordingly? I have actually come across the above scenario where I updated some files (I can't remember the exact context, but the files covered the same seed id(s) and time windows, but there was something different about the waveform data itself) and it did end up duplicating them in the index.
d-chambers commented 7 months ago

If you move the files around within the bank, wouldn't it then be desirable to delete the old entry and add the new one to the index? Unless you "move" them to the same spot, the bank will no longer be able to find them.

Yes, good point. If you don't update it will no longer have the correct path so the bank wont work. Maybe we do just want to use the ctime then?

Would a longer-term strategy be to both check the ctime (assuming it behaves how we think) and then also check if the data are already in the index and revise it accordingly?

By check if it is already in the bank you mean to just look at the path right? It would get much more complicated if we wanted to check the contents, or the file hash, or something like that to see if two files are the same (or nearly so). I think it would be a pain to do either way.

Maybe a pragmatic way to address the duplicate issue, if we don't do this already, would be to just drop duplicates from the contents dataframe before determining which files to load. That way each file can only get loaded once so there really wouldn't be a performance hit for having duplicates in the index. As long as the creation of duplicates are relatively rare, I don't think it would be a big problem.

shawnboltz commented 7 months ago

By check if it is already in the bank you mean to just look at the path right? It would get much more complicated if we wanted to check the contents, or the file hash, or something like that to see if two files are the same (or nearly so). I think it would be a pain to do either way.

I was just thinking to query the index for any entries that have the same contents (minus path) as the new data and then update the path for the entry if it's already in the index. I don't know how the logic for writing to the index works, though, so it could be a bigger PITA than I'm imagining.

Maybe a pragmatic way to address the duplicate issue, if we don't do this already, would be to just drop duplicates from the contents dataframe before determining which files to load. That way each file can only get loaded once so there really wouldn't be a performance hit for having duplicates in the index. As long as the creation of duplicates are relatively rare, I don't think it would be a big problem.

Maybe. In the case of a file that got moved, how would we know which of the duplicates to keep when loading the data?

d-chambers commented 7 months ago

Maybe. In the case of a file that got moved, how would we know which of the duplicates to keep when loading the data?

Wouldn't it work if we always keep the files with the latest ctimes?

d-chambers commented 7 months ago

Thinking more about it, would we want the bank to completely re-index if we move the bank folder? If we used ctime that would be a likely side effect.

d-chambers commented 7 months ago

Hi @lvanderlaat, a quick follow up: did the provided work-around work for you?

d-chambers commented 7 months ago

@shawnboltz and I had a discussion and we think changing ObsPlus to use ctimes rather than mtimes is the right path forward for less surprising behavior. We will open a PR in the (hopefully) near future.

dcroman commented 1 month ago

I am also running into this issue and either the solution above doesn't work or I am misunderstanding what should be in path1 and what should be in path2. Right now I am trying to add new files available in path1 to the existing wavebank of files in path2. Any guidance?

(I found a way around it for now, which is to delete the hidden index file in path2).

shawnboltz commented 1 month ago

Hi @dcroman,

In your scenario, is path2 the path of the bank or the path of a specific new file? If it's the path of the bank (or, more generally, a directory), then you need to touch each individual file within it that needs to be updated. Here's a simple example that I just tested:

import shutil
from pathlib import Path

# As some setup that I just did manually, I created two folders: "folder1" and "folder2" and created a handful 
#  of empty text files in "folder1"

path1 = Path("folder1")
path2 = Path("folder2")

shitul.copytree(path1. path2, dirs_exist_ok=True)

# Note that this will touch -all- files in the directory, not just the new ones. You'll want to be more specific or 
#  you'll effectively re-index everything and possibly end up with duplicate entries
for p in path2.rglob("*"): 
    p.touch()

As a worst-case scenario, it also works to delete the .index.db (EventBank) or .index.h5 (WaveBank) file and just force it to re-index everything. That's not ideal if it's a big bank or something you're going to do repeatedly, however.