sul-dlss / common-accessioning

Suite of robots that handle the tasks of accessioning digital objects
Other
2 stars 1 forks source link

Watch ABBYY output folder to determine when OCR is complete #1191

Closed peetucket closed 3 months ago

peetucket commented 4 months ago

part of https://github.com/sul-dlss/common-accessioning/issues/1166 (ocrWF:ocr-create)

We need a process that is continually running that monitors the ABBYY folder to determine when an OCR is complete for an object so that the workflow step can be updated appropriately.

thatbudakguy commented 3 months ago

a quick survey of available tech for this was not super encouraging.

my goal was to see if there are any system-level tools for checking this sort of thing without polling, so that we wouldn't have the overhead of lots of processes continually running checking to see if something in the folder changed.

it seems that the way this is normally done on linux systems (presuming that's what common-accessioning is) is through inotify, which most of the GUI linux filemanagers (GNOME, etc.) use under the hood.

there is a lot of discussion out there about how inotify doesn't play nice with network fileystems, including samba. this transcript of a linux foundation meeting is basically lamenting how this is still the case in 2022, where OSX and windows systems can receive change notifications about network filesystems but linux is still behind. see also this thread on the docker forums about how monitoring filesystems from docker containers suffers the same problem for the same reason.

systemd provides the path unit, which you can use to set up all kinds of automatic notifications about filesystems as a background process. unfortunately, it uses inotify too:

Internally, path units use the inotify(7) API to monitor file systems. Due to that, it suffers by the same limitations as inotify, and for example cannot be used to monitor files or directories changed by other machines on remote NFS file systems.

in ruby-land, the listen gem looks like a reasonably well-supported way to do this and provides a nice API, but it too warns:

Some filesystems won't work without polling (VM/Vagrant Shared folders, NFS, Samba, sshfs, etc.).

that said, changing to use polling is just a single option in its API (force_polling: true), so at least it's not hard to turn on if we have to.

thatbudakguy commented 3 months ago

the samba docs make it sound like change notifications are possible, but I don't totally understand it:

File Alteration Monitor, or FAM. An SMB client subscribes to file change notifications on the server, which allows it to know about file modifications. Done by default by windows clients. Samba server uses inotify on linux to implement this feature, it is fully supported, controlled by kernel change notify smb.conf parameter (enabled by default).

This is where interoperability context comes into play: without extra steps, a unix process does not know about modifications done to files by other unix processes. While smbd might notify other smbds about files it modifies on behalf of its clients, this can't be said about other processes running on the same server doing their own work. For example, a unix user might change a file in their home directory, and this change should become visible to all clients accessing this file or directory at the same time. Smbd enables inotify monitoring for files/directories on behalf of client requests, and such changes becomes immediately visible to SMB clients.

...which sounds sort of like this functionality mentioned in the linux discussion:

Whatever notification watches are placed on the local file are forwarded to the remote file server, which sets up inotify and forwards events back to the local filesystem.

so, I guess inotify actually runs on the remote fileserver instead of on the machine accessing the fileshare. maybe to listen to these events, we would need to use a samba client (in ruby?)

the smbclient utility does have a notify command, but the sambal ruby wrapper for it doesn't seem to implement that.

peetucket commented 3 months ago

listen gem sounded familiar, but not clear we are using it other than in development contexts: https://github.com/search?q=org%3Asul-dlss+gem+%27listen%27&type=code

And that is relying on the underlying tech in the file system under the hood, which may or may not work well for us, right?

This may be a good topic to throw out all the engineers for additional insight/ideas

thatbudakguy commented 3 months ago

I think it's because rails itself uses listen, so we see it in dep updates, etc.

thatbudakguy commented 3 months ago

I just found ActiveSupport::FileUpdateChecker, but nothing in the class body seems to explain how it works. The example given is watching rails's i18n files for changes:

i18n_reloader = ActiveSupport::FileUpdateChecker.new(paths) do
  I18n.reload!
end

ActiveSupport::Reloader.to_prepare do
  i18n_reloader.execute_if_updated
end

I guess maybe you would still need to call #execute_if_updated as part of a polling loop to make it work?

It seems ActiveSupport::EventedFileUpdateChecker is actually the one Rails is using, which calls listen internally.

thatbudakguy commented 3 months ago

I did some testing using the smbclient utility on QA.

I connected to the samba mount using the credentials that are already placed on the box by puppet; only root currently has permission to read them so I had to ksu first.

$ smbclient //dpglab-ocr-a/sdr-ocr-qa --authentication-file=/etc/samba/credentials/smbcred.dpg.labsrvc
Try "help" to get a list of possible commands.
smb: \> l
  .                                   D        0  Fri Apr 19 10:02:57 2024
  ..                                  D        0  Fri Apr 19 10:02:57 2024
  EXCEPTIONS                          D        0  Fri Apr 19 10:02:20 2024
  INPUT                               D        0  Fri Apr 19 10:02:10 2024
  OUTPUT                              D        0  Fri Apr 19 10:02:15 2024
  RESULTXML                           D        0  Fri Apr 19 10:02:48 2024

        104824063 blocks of size 4096. 99003344 blocks available

I used notify to watch the INPUT directory, and then made some changes in another terminal. These lines showed up one by one as I was making changes.

smb: \> notify INPUT
0001 test.txt
0003 test.txt
0001 bb112zx3193
0001 bb112zx3193\input.txt
0003 bb112zx3193\input.txt
0003 bb112zx3193

This was the result of creating the following file structure:

.
├── EXCEPTIONS
├── INPUT
│   ├── bb112zx3193
│   │   └── input.txt
│   └── test.txt
├── OUTPUT
└── RESULTXML

5 directories, 2 files

The numbers shown in the output of notify are detailed in this archived microsoft page; it looks like creating a file results in both a "created" (0001) and "modified" (0003) event immediately. Directories get "created" just like a file, and if a file is created within a directory, that directory also gets a "modified" event.

thatbudakguy commented 3 months ago

It is also possible to do:

smbclient //dpglab-ocr-a/sdr-ocr-qa --authentication-file=/etc/samba/credentials/smbcred.dpg.labsrvc -c 'notify INPUT' > input_changes.txt

And then access that file from another process.

thatbudakguy commented 3 months ago

I wonder...if we ran some background processes that essentially spit out notify INPUT, notify OUTPUT, etc. to logfiles in /var/log or similar, that might be workable?

thatbudakguy commented 3 months ago

write a script that can be run as a service that invokes smbclient notifiy and writes to a file using ts set up logrotate for new services via profile::logrotate::rules in puppet

(skipping the above approach for now in favor of something simpler)

edsu commented 3 months ago

Not to throw a damper on this good research, but If we know what file ABBYY is going to write as output we could also:

flowchart TD
  A[Look for File] --> B{Found?} 
  B --> |Yes| C[Done!]
  B --> |No| D{Slept Too Long?}  
  D --> |Yes| F[Raise Error]
  D --> |No| E[Sleep] 
  E --> A
thatbudakguy commented 3 months ago

For sure; I assume that's exactly what using force_polling: true in the listengem would do. My assumption is that it's less desirable because you'll have one polling process per file, which could add up to a lot of filesystem queries if a lot of things are getting OCR'ed at once.

It might turn out to be a wild goose chase, but I wanted to see if there was a solution that would result in a single process (or one per ABBYY directory) that gets notified by the OS when any files are modified, instead of having to poll at some interval.

peetucket commented 3 months ago

Since we can control what the output XML file is called and where it will live, we can name it by druid, I think we can then monitor that folder with a single process, and when a new output XML file appears, we will parse it to see what to do (OCR complete, go look for files, errors, raise, etc.)

thatbudakguy commented 3 months ago

I tried out a very simple script using the listen gem.

The core of it was basically:

listeners = WATCH_DIRS.map do |dir|
  watch_dir = File.join(abbyy_root, dir)
  puts "Watching #{watch_dir}"

  Listen.to(watch_dir) do |modified, added, removed|
    puts "#{watch_dir}: modified #{modified}" unless modified.empty?
    puts "#{watch_dir}: added #{added}" unless added.empty?
    puts "#{watch_dir}: removed #{removed}" unless removed.empty?
  end
end

listeners.map(&:start)
sleep

And it successfully detected me modifying some files in INPUT:

$ ./bin/watch_abbyy /abbyy/
Watching /abbyy/INPUT
Watching /abbyy/OUTPUT
Watching /abbyy/RESULTXML
Watching /abbyy/EXCEPTIONS
Press Ctrl-C to stop
/abbyy/INPUT: added ["/abbyy/INPUT/myfile"]
/abbyy/INPUT: modified ["/abbyy/INPUT/myfile"]
/abbyy/INPUT: removed ["/abbyy/INPUT/myfile"]

I think now we need to check if ABBYY adding the files from its side triggers the handler too. If not, we would need to use force_polling: true.

peetucket commented 3 months ago

Great! So the listener just sits there running in a process until you stop it?

thatbudakguy commented 3 months ago

Sort of – it runs in its own thread, but you need to sleep or otherwise do something to prevent the process from exiting, otherwise it'll stop and be cleaned up by Ruby.