radiocosmology / alpenhorn

Alpenhorn is a service for managing an archive of scientific data.
MIT License
2 stars 1 forks source link

Fix #9: running `import_files` from any subdirectory of Alpenhorn root #100

Closed cubranic closed 5 years ago

cubranic commented 5 years ago

client.import_files has a number of built-in assumptions left over from the old world of flat acquisitions, so it would not work properly when run from any subdirectory of an arbitrarily nested structure (even subdirectories of an acquisition).

This change implements the "natural" extension of the original import_files design to Alpenhorn v2 directory structure. The starting directory can be inside an acquisition or a parent directory of several acquisitions, but either way it only considers for import files that are found by os.walk from that starting point.

If --acq option is given, the walk is further restricted to only look at files that would belong to those acquisitions.

jrs65 commented 5 years ago

I must admit, I'm struggling to follow the internal logic of how this works now, it's started to get complicated. It seems like it would be easier to:

  1. Identify which node we are in.
  2. Build the list of paths to all files within the node by crawling the FS from the root (call this A).
  3. Build the list of all files from the DB (call this B)
  4. Build the list of all files on this node from the DB (call this C).
  5. Register all files in A n ( B \ C))
jrs65 commented 5 years ago

Okay. I just realised that you only want to import things within your subdirectory. I think the above still works you just need to do the FS crawl from that directory.

cubranic commented 5 years ago

I got back to this on my flight from Montreal today, and rewrote the import along the lines you suggested. It's also much simpler now. I'm not sure why it was so convoluted before, TBH. But it does seem to change some of the behaviour of alpenhorn1, so I'd like to walk through it with you as a check.

jrs65 commented 5 years ago

Yeah, this is great. The logic is now really easy to follow through, even more so than I thought it would be. Thanks!

cubranic commented 5 years ago

One thing that I'm not sure about is why the old import_files skipped any directory that wasn't an already-known acquisition. From alpenhorn1:

            try:
                di.parse_acq_name(acq_name)
            except di.Validation:
                not_acqs.append(acq_name)
                continue

            try:
                acq = di.ArchiveAcq.select().where(di.ArchiveAcq.name == acq_name).get()
            except pw.DoesNotExist:
                not_acqs.append(acq_name)
                continue

This latest change doesn't do that: if a directory passes the "ArchiveAcq.detect", then it is registered in the DB and all valid files within it will be imported as well. Is that OK?

cubranic commented 5 years ago

I'm going to open a separate PR for optionally registering new acquisitions.