sbs20 / scanservjs

SANE scanner nodejs web ui
https://sbs20.github.io/scanservjs/
GNU General Public License v2.0
768 stars 146 forks source link

Add "do nothing"-logic to the end of the pipeline, mechanism. #553

Open torwag opened 1 year ago

torwag commented 1 year ago

I am working on a python program, called by the pipeline, it starts another process to work on the images.

Python can do what is needed the same way as, ImageMagick. However, in addition, it can send the compiled PDF to other services (in my case I use the paperless API) or deal in other ways with the scanned images (e.g. backup the scanned files). Even more complex features are possible. E.g. my program removes empty pages, which is really helpful, if you do mix scan of single and double-sided prints.

This program will exit as soon as the processing part was initiated in an external process (that is, the external process keeps running as long as it takes to process the data). Thus, it should relatively fast unblock scanservjs. Allowing to perform the next scan almost immediately.

The entire toolchain works so far. Just the implementation into scanservjs is troubling me right now. AFAIK scanservjs expect the final file as prompt to be able to process this file further (e.g., adding it to the document folder). Obvious, as this runs now asynchronously, this file is at that point in time not ready yet. I could fake a file, which would be used by scanservjs, but feel that this is a dirty hack, populating the document view with rubbish files. I want to interfere with the original logic as little as possible, but would it be possible to signal scanservjs that in this particular case, a further processing on scanservjs side is not required.

From easy to hard:

If the pipeline returns finally e.g. "None", skip all following tasks and exit the current scan procedure without error.

As a second idea, which might require a bit more coding. If the final output of the pipeline is "None: ", do nothing but output to the scanservjs messaging service. could e.g. "Post-Process for paperless started externally..."

As a final idea, if scanservjs would enable a way via the API to send status messages to the webfrontend, I (and others) could send messages of the concurrent running processes to the user.

As for why is that needed. I run scanservjs on a Rpi, soley to create a document scanning station. The processing of a scan of 10-20 pages takes around 2-3 min. During this time, scanservjs is blocked and can't be used. Users hate to wait 2-3 min in front of the machine to start the next scan.

Just noticed I am not alone with this challenge: #537

sbs20 commented 1 year ago

Ok. What about this

It links into #504 which will allow users to run custom actions on any given file (for things which have already been scanned). You get to do whatever you want with the file - update it, delete it, move it, call a program with it. But it can also be tied to run after a pipeline. It's not exactly what you were asking for but it's close - and has the benefit that you can run it from the UI too.

torwag commented 1 year ago

Hi, thanks for looking into it. I can see and welcome the benefits of using afterAction hooks to further manage file tasks in the UI. I might miss the exact difference of afterActions compared to directly be done in the pipline. It is only aout being accessible via the UI or does those tasks run asyncronous as well?

What I don't understand yet, I have to deal with the raw-files from the scan process. Bascially the ~tmp-* files of the temp folder. My program moves those into a process folder and calls the asyncronous process to generate the final outputfille. As actions only take a single file as parameter, this doesn't really help me, right? If I understood it correct and actions are really executed asynchronous, that would help my program a lot as the main complexity is the point to run an external child process which doesn't terminate with the parent process. I could get rid of all this logic, if scanservjs runs whatever is called asynchronous anyhow.

Your proposed solutions gives me two challenges:

  1. I would need to get a way to submit more then one file to the action.
  2. I would need to create a pipeline which is reasonable fast, which makes me wonder what should be the output of that pipeline to consumed be consumed by scanservjs, if it isn't the final file.
sbs20 commented 1 year ago

As for asynchronous, it's not going to happen.

Scans can be performed as a single process (adf with batch), or many sequential processes in different HTTP requests (flatbed with manual). Each of these has to feed into the pipeline. With the first case (batch) there's no way to query the scanimage process to see where it's up to - or how far through it is (it stops when the feeder runs out) so that leaves polling the filesystem. Polling the filesystem is complicated by not knowing when a file has finished being written to. And all of that would have to be mediated by awaiting / joining a load of promises or running a secondary service. The approach would be different with manual collation, but reporting errors would be harder since the promises would effectively be orphaned, the HTTP request having finished. Again, there are means to work around it, but I simply don't have the appetite to do it.

I appreciate it's not ideal on low powered devices but I already struggle for time on this project and need to balance reliability, ongoing maintainability (new versions of node / Vue), ease of use, installation / package size, extensibility and do that for all users with competing and sometimes contradictory requirements.

I know this may not be an option for you - and I hesitate to mention it, but you can pick up second hand NUCs for not much more than a RPI and their performance is an order of magnitude better.

Let me know if you need help on the afterAction.

ukos-git commented 1 year ago

I agree that implementing async /await or threaded processes is a time-consumptive task. I also experimented with File System watchers that look for generated scan files and start processing or uploading them during scans which was working but still not easy to do.

Let me point out that scanimage has the ability to print the file to stdout in a batch scan as soon as it was generated:

-b, --batch[=FORMAT]       working in batch mode, FORMAT is `out%d.pnm' `out%d.tif' 
(...)
    --batch-print          print image filenames to stdout

As much as I like python programs for processing, I'd like to state that python is not the best performer here. For most image manipulation cases, there are already optimized linux programs which is why I mostly stick to bash for image manipulation. Let's consider a bash function that can read from stdout like this:

# pipe-able function that filters empty page files from stdout
filter_empty_pages() { 
    while read file
    do  
        white=$(convert $file -fuzz 0% -negate -threshold 50% -negate -format "%[fx:100*mean]" info:)
        empty_page=$(echo "scale=4; ${white} > 99.5" | bc)
        if ((empty_page)); then
            if ((VERBOSE)); then
                echo "page $file is empty" >&2
            fi
            continue
        fi
        echo $file
    done
}

This way, We can use the time during scans for processing and manipulating the files through the pipe that is opened by scanimage and closed by the last process.

scanimage ... --batch-print ... | filter_empty_pages | rotate180 | ...

All the piping is done on a single-file processing level and would be compatible with manual interaction where changing the paper in the Flatbed source is required.

In my case, I also re-order the pages last to first after the pipeline was successful. This is obviously a thing that can only be done after the scanimage command terminated.

A last statement, I'd like to make is that tesseract is also compatible with reading from feeds, which allows OCR processing during scans, generating the resulting pdf in the end:

   scanimage -batch-print ... |  tesseract \
        -c stream_filelist=true \
        -c min_characters_to_try=10 \
        --dpi 300 \
        - - pdf \
        > out.pdf

For multi-page scans (30+ pages) this safes a lot of time.

I'd propose a two-fold pipeline logic:

This would not affect any current logic and extend the command class to include pipeable commands. Again based on bash for compatibility to any program that the user likes to include here.