irods / irods_capability_automated_ingest

Other
12 stars 15 forks source link

Restore event handler interface, fix log formatting technique #236

Closed alanking closed 3 weeks ago

alanking commented 1 month ago

Addresses #219 Addresses #222

Tests are passing. Need to test with S3 scanner as well since there are no S3 tests in the regular suite yet.

trel commented 1 month ago

if we're using meta for 'scanner'... do we want to use that name everywhere for consistency?

or do we need a better name that encapsulates its various (currently appears to be 'two' different) uses?

alanking commented 1 month ago

if we're using meta for 'scanner'... do we want to use that name everywhere for consistency?

or do we need a better name that encapsulates its various (currently appears to be 'two' different) uses?

The name of the variable does not need to match the name of the key in the meta dict here, but I suppose I agree that it would make things more consistent. I was just matching what was already there in order to minimize the diff, but can update it. What are the "two" uses?

trel commented 1 month ago

the two i see are 'scanner' and 'syncer'. is there a better 'composite' word? or is 'scanner' sufficient?

alanking commented 1 month ago

I see. The only thing I would be concerned about at this point is the key being used in the meta map because that could be accessed by a custom event handler - although that seems extremely inadvisable :) - and changing that in the future would be another change to the interface. The name of the variable is less important because it is internal to each function (or at least it is now).

The term "scanner" was chosen to match the name of the class, and the name of the class was chosen, IIRC, to reflect the fact that the class is responsible for implementing a way of walking or "scanning" different storage systems. This is in reference to the filesystem scanner use case as well, but there are of course other use cases that we advertise (namely, landing zone).

I think "syncer" was used due to the filesystem "sync" referenced throughout this codebase and in our training. We have operations that are not considered "sync"s (e.g. PUT_APPEND) so not everything being done is a sync. So, I would be more in favor of changing "syncer" to "scanner", but I'm not attached to that term, really. Open to suggestions.

trel commented 1 month ago

So, I would be more in favor of changing "syncer" to "scanner"

I think that is correct and good. Let's make that change and see how it looks/feels.

alanking commented 1 month ago

In order to avoid ambiguity (i.e. syntax errors, and... confusion) can we differentiate the scanner class and instantiations by name?

Here's what can happen if we name the instance the same thing as the class:

>>> class scanner:
...     def __init__(self, x):
...             self.x = x
...     def foo(self):
...             self.x = self.x + 1
... 
>>> scanner = scanner(1)
>>> scanner.foo()
>>> scanner.x
2
>>> scanner.foo()
>>> scanner.x
3
>>> scanner2 = scanner(10)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'scanner' object is not callable

Once the scanner is instantiated as scanner, you cannot instantiate any more scanners because the class scanner has been shadowed by the instance named scanner.

With that in mind, here are some suggestions...

alanking commented 1 month ago

We could also consider renaming the class, but I think we can / should do that later...

trel commented 1 month ago

storage_scanner seems better than path_scanner b/c S3 isn't 'paths' officially.

walker and crawler are nice... the scanner is the technology/approach, and the walker/crawler is the thing doing the work.

trel commented 1 month ago

although upon 30 more seconds of thought... i like scanner_instance.

that way we don't invent new language/nouns... but avoid the collision.

trel commented 1 month ago

looking real good.

korydraughn commented 1 month ago

Agreed. scanner_instance is a good choice.

alanking commented 3 weeks ago

Confirmed that a basic S3 bucket scan worked with these changes.

alanking commented 3 weeks ago

I like the current commits, actually, so just added #s. Mergin