trufanov-nok / scantailor-universal

ScanTailor Universal - a fork based on Enhanced+Featured+Master versions of ST
http://scantailor.org
Other
181 stars 16 forks source link

Content Size Normalization #59

Closed ylluminate closed 6 years ago

ylluminate commented 6 years ago

Seeing your activity here makes me think that perhaps you might have some meaningful input or thoughts on this in the main codebase: https://github.com/scantailor/scantailor/issues/298

trufanov-nok commented 6 years ago

I'm not original author of the code, so I can't say for sure, but afaik, there is no any font size impact on processing results. May be on dewarping, but not for others.

At the end of the process you end up with textual content sizes that are mismatched.

It's not a problem. It should be like that. If you mean content zone detection, then it should be around page areas that contain some content, as close to it as possible. Yes, they'll be different due to chapter starts etc. It's fine. It's addressed on the next step. There might be a problem that content zone on this step is wider than real one as it's wrapped around real content + some dots and spots around it. And many people want despeckling at this stage too (it's only done at stage output right now). This isn't implemented right now. If you mean output page sizes - they're always equal till you aren't switch off "Match with other page sizes" on stage "Page alignment". They all matched to [max width of all pages X max height of all pages] that have this option on. If there are any gaps - the alignment option controls them. If you want "original page size" no matter of what then Auto-margins option is for you. If you downloaded my release then check Settings window as there are some detailed explanations of how features works under Page Layout settings group. There are clear (I hope) explanations of how final page size is calculated. In a nutshell - you add margins (may be different size) around content box, not resizing the content box.

daa2018 commented 6 years ago

Do you process scans or photos ?

If photos (large differences) - try fork Experimental and function "Match size by scaling".

If scans (small differences) try ClearScan recognition in Adobe Acrobat. (vectorizaton of fonts with further averaging). Then you can save your pdf as 600 dpi tiff images.

ylluminate commented 6 years ago

Wow @daa2018 - yes, they're photos, so that might be perfect(ish). Appreciate your pointing that out.

@trufanov-nok I'm going to have to check out what you're talking about a bit more closely via building it.

Are there any Homebrew or MacPorts building recipes or packages around?

trufanov-nok commented 6 years ago

As for building for Mac, well.. I have no experience with Macs and no machine to do that on my own. There are instructions on how to build original ST for Mac that were inherited by this fork: https://github.com/trufanov-nok/scantailor/blob/master/packaging/osx/readme.en.txt But they are outdated.. At least Qt4 should be replaced with Qt5 as we've switched to Qt5. It could be found here. Probably build script must be adjusted too. As the process is very similar to Windows build instructions I may assume that it will be very hard to do. I may try to build Mac ver this via Virtual Box if it's possible when I find some time.

trufanov-nok commented 6 years ago

I'v managed to build it for Mac. Used v10.9. Not sure if it'll launch at your machine. Would be great if you try it and report back if it launches plus your system version. Also there might be some bugs under the hood as noone tested it. ScanTailor.zip

ylluminate commented 6 years ago

Thanks so much @trufanov-nok. It might be worth putting together a Homebrew (or MacPorts) formula for it for posterity, etc.

FYI, your build DOES run on macOS 10.13.4. I'm still having some issues with sizing here. I attempted to use Dewarping -> Mode:Marginal (experimental), but it still doesn't seem to stretch it out to normalize sizing based on margin. I wonder if there is a way to do this collectively so as to calculate the entire average across all pages? Here is an example of two pages (content redacted) to demonstrate size differentials: example

@ylluminarious would you be interested in perhaps looking at this to make a formula for each of these variants that could have potential utility? It could also dovetail with this: https://github.com/scantailor/scantailor/issues/273

Or the other option would be that we can look into a MacPorts version / update. Might be interesting to start contributing to that as per our discussions.

trufanov-nok commented 6 years ago

your build DOES run on macOS 10.13.4

Cool!

I attempted to use Dewarping -> Mode:Marginal (experimental), but it still doesn't seem to stretch it

Dewarping isn't about sizing. Size of pages are fully defined at stage 5. Page Layout. Dewarping tries to change content zone geometry to make sure all lines are not curved. It helps with text in thick books that is close to book's spine and could be curved on scans. Change of geometry may affect content zone size but at the end of all it will be scaled to page size calculated at stage 5. Page Layout. So final page size won't change and wrapping isn't a right tool to handle content zone size.

I'm not sure if I fully understand the problem, but I would expect some dpi-related issues with your scans. Could you send both to my email? trufanovan@gmail.com

As for Mac build - it seems noone tried to build it for Mac for years and I fixed some problems I faced during that, but at least one problem I rather workarounded than solved. So at least this fork of ST is not ready to be compiled for Mac and I wouldn't expect making formulas to be easy. Right now if one build codebase for Mac fixing paths I hardcoded the executable will be crashing on app exit. At least for my virtual machine. I suppose it's some memory leak. On the other hand if one build the app on Mac with Qt Creator IDE instead of cmake/make toolchain - executable has no such problem. I think it's compilator/linker parameters problem and I just replaced executable in dmg rather than solve this puzzle. So it's still not something that could be easily build for Mac.

UPD: I'll check the discussion you refers to find some helpful info. Perhaps I can patch existing formula for Qt5-based ST forks by myself.

ylluminarious commented 6 years ago

@ylluminate Sure, I can try my hand at writing a formula for this. I'd like to see what @trufanov-nok does first, and perhaps offer help if I can. I'll try out this whole thing with homebrew first and if it becomes problematic in some way, I'll take a look at making a Mac Port.

ylluminate commented 6 years ago

Thanks @ylluminarious!

Did that help @trufanov-nok?

trufanov-nok commented 6 years ago

@ylluminate you may try

brew tap trufanov-nok/scantailor-universal
brew install scantailor-universal

to build STU with latest qt 5.10.1 or brew install scantailor-universal-qt5 to build it with qt 5.5 from homebrew-core. Let me know if it works. I'm still waiting when qt compiles on my virtual machine.

ylluminate commented 6 years ago

@trufanov-nok I've sent a couple emails now, but it seems they may not be getting to you?

trufanov-nok commented 6 years ago

Ok, I've just found them in a spam folder. Now everything should be fine.

trufanov-nok commented 6 years ago

Well, I've checked the images. Files don't contain tags with vertical and horizontal DPIs. If you know or can roughly measure the actual physical size of book page or content in it than we could calculate DPI values for sure. But I've just assumed they are 300x300 dpi bcs (of file sizes). Which will result to a roughly Latter (A4) size.

How did I get that? Just import your files to a project and apply 300x300 dpi to all pages. (You need to do this manually as ST can't find dpi info in your files.) Then while you're on stage "1 Fix Orientation" check the right bottom corner. There will be the image size in pixels on status bar (Well, at least it's there on Win and Linux). Click on it and it will switch between millimeters/centimeters/inches/pixels. Pixels are always the same but others depend on original page dpi and show expected image physical size. For example page 00N.jpg will give you 6.41333x10.5633 inches.

It's need to be said that there is no such indicator in original ST - I've made it for myself in scope of STU. Among all forks (if I recall right) only ST Advanced (another actively developing fork of ST) has imported this feature in its project.

Ok, now pass through the stages 2 and 3. and consider stage "4. Select Content". Check the value in status bar. Now it shows the size of content zone on page rather than whole page size. (On each processing stage it displays the size of the thing you're editing). Check its value for 00N.jpg in pixels. It's 1306x2508 px. Then go to previous page: 00M.jpg. Check the value. It's 1535.33 x 2898 px.

No we see that their content boxes are too different, If you check other pages you'll find that context boxes of the left pages (I think they are left) are always bigger than right ones.

00M.jpg and 00N.jpg content zones will be 5.11778x9.66 inches. and 4.35333 x 8.36 inches. respectively.

Then go to 00M.jpg, open Tools/Fix DPI.../All pages, then find 2209 x 3638 px/00M.jpg and apply 350x350 dpi to it. Then check its size indicator again. The pixel size will be same as it's not "physical" and not DPI dependent, but size in inches will change to 4.38667 x 8.28 in. Which is close enough to the left page. By increasing assumption about source scan's dpi we increase number of pixels needed per inch and thus decreased its physical size. We scaled it down.

I assume that someone used two different scanners for left and right pages with different settings or two photo-cameras with different distance to page (the last one is unlikely) or scans were postprocessed in 3rd party app and somehow scaled. If you increase dpi for the pages that are wider than 2100 pixels you'll scale all pages to match each other (more or less).

The problem is that STU uses same DPI tool as original ST and it isn't designed to change a lot of pages per once (only if they are grouped by the same size). So you have to do this manually. The other options would be:

a). Import pages as 300x300 dpi. Process them as usual but set 0 margins and switch off "Match with other pages" at stage "5. Page Layout". Then output results. There will be just content zones. Then you can apply some 3rd party tool to scale all images to some boundary size. Some console tools like imagemagick for Linux should allow to do that. After that these images could be imported to STU again with equal dpi but they'll be already scaled so you just need to add margins to them and outpot again. Technically this approach let you postpone processing stage "5.Page Layout" while benefiting from "4. Select content" and apply some external scaling between them. But looks like this approach is unacceptable in your case bcs your content zones have not only different heights, but also different widths. This is because of marks nearby greek text. Thus you have no any target size to scale all pages to it. Otherwise your "too wide" pages will be too small or others too large.

a1). Split source pages to two groups and process both as a separate projects. Then merge output files somehow resolving conflicting filenames and keeping right order.

b). Find a tool that can write vertical and horizontal dpi's tags in image file header. You can easily select image files that need to be patched by sorting files by image size in file browser. Then you can write 350x350 dpi for one group and 300x300 for other. in this case ST will automatically detect dpi and you'll not face DPI Tool dialog at images import window. The problem is that I'm not sure if JPG format supports such tags. PNG and TIFF supports for sure - I'm working with PNG sources. If it's not you'll need a tool that bulk convert jpg to png and write required tags to png. Or you need a tool that convert and a tool that write tags. There should be options. You'll need to google this out.

c). Do this manually.

d). The original ST's DPI tool isn't designed for page multi selection (only if they are grouped to the same size) so mine is too. But ay be Scan Tailor Advanced or Scan Tailor Experimental by Tulon have more functional tools. I heard Tulon somehow managed to hide dpi settings from user and do page size detection automatically. But I'm not tracking both of these project so I'm not aware of that. And not sure if there are Mac versions.

I think DPI dialog isn't difficult to enhance for multi-selection. It's just not in y ToDo list. I can try to do this in a couple of weeks or month. Till that I plan to do nothing. And I'm hoping to rewrite this dialog completely till the end of this year.

Note: DPI setting at stage "6. Output" isn't related to the source image dpi and defines just the size of output page.

ylluminate commented 6 years ago

Hmm, when I started this project in ST originally I did assign a 300x300 DPI to all pages... The content width and height are roughly 103 mm wide by 188 mm tall. The pages themselves are 156 mm wide by 238 mm tall.

So it appears that I really need to go through and re-set the DPI resolution for your version of the application.

I also had carefully checked all content selections in the previous version of ST and it appears that the content selection has been preserved in your new version.

Hmm, I'm going to have to carefully go through your notes above here. As far as I understand it, the images were captured with the same camera, but were done in two different sessions and distance could have changed a little each time, especially with such a thick book. Further, it was run through a program to eliminate the distortions as well initially before this phase (apparently a program called Scanner Pro for iOS works very well, but obviously it has done some funny things with output).

ylluminate commented 6 years ago

Please tell me if I'm sane:

So I had thought at one point I could just use ST(U) to first output all pages with 0 margin. I had tried this, but ended up with mixed issues, BUT if I did do this, I could then do as you said and extract the entire project out into groups: A) English, B) Greek, C) odd pages that don't fit into any sizing (chapters, indexes, etc.)

With this approach, I would then be able to somehow (I guess use imagemagick or some other resize tool, not sure which yet as I would want high quality resizing with minimal blurring; wonder if there are any better or higher quality macOS options - Photoshop does not seem to do this quite right(?)) resize the pages to meet their general size requirements based on NO marginal boundaries. I could do this both vertically and horizontally based on the standard size at a set resolution (300 dpi I suspect). I probably would keep them all TIFF since this is output from STU.

Then after I do this proper resizing for all pages, I then remerge them back into a singular folder and then run them through STU again to get proper marginal sizes and finished output.

Does this sound right?

trufanov-nok commented 6 years ago

Yes, but i would rather group pages by desired width. So greek and english could go to one group if you want their width to be same. But greek page with one mark (from left or right) is wider and should go to another group. Greek page with 2 marks (both left and right of main text) is even more wider and should go to group 3.

Then i would scale each group's widths to desired width and height proportionally to the change i made to width . So it's a scale to the specified size keeping original aspect ratio, like wallpapper on pc that doesnt mach whole screen size. Just use very big target height value. Looks like this is default scaling for imagemagick.

And in 2nd project i would add some big enough margins and verticaly aligned all pages to bottom. So all pages content zones will start at x from bottom. Then sort all pages by pages height and found all end of chapters etc on top of thumbnail list. There are usually small number of such pages. And manually increase their bottom margin to lift the content etc.

Btw, i heard thst ScanKromsator app has different idea besides managing final pages size. Never used it, but perhaps it can scale content zones. Have you seen this scans editor? But I think its not opensource and for Win only...

ylluminate commented 6 years ago

Got a crash while attempting to apply margin size to all pages. This was done on pages where I added DPI to them and used PNG instead of TIFF for final output to test it out.

ScanKromsator looks interesting, but being Russian + seeming to miss some overall general functionality in detecting content for already somewhat processed pages seems like a non-starter. EDIT: found some english instructions.

trufanov-nok commented 6 years ago

and used PNG instead of TIFF for final output to test it out.

ST outputs only tiff images with different compression methods, so you couldn't. If the bug is reproducible with the set of images you sent then you can press Ctrl+S just before the cash to save the project file and send it to me to check the crash. May be it's a Mac thing.

ylluminate commented 6 years ago

Meant final output from the conversion app for input into STU.

Email sent. Crashed again on margin change.

ylluminate commented 6 years ago

So I managed to pretty much get things in order as far as possible aside from the aforementioned crash.

I exported all images from ST with 0 margin as I had them in there already with well defined margins and I couldn't work with STU crashing.

I then split the pages into groups as appropriate.

Following this I resized each group and individual pages based on sizes measured in millimeters and set the DPI.

The program that I used, XnConvert on macOS, ended up screwing things up apparently. I found in Photoshop that the DPI was really strange and ST and STU were acting weird. I ended up resizing all pages with ImageMagick (for f inls *png; do convert -units PixelsPerInch -density 600 $f 600/$f; done). After this I found that STU acted better, when I went to set a small margin size around ALL pages, it just kept crashing at this point.

I believe I'm about finished here, but I have found that BOTH ST AND STU duplicate some images. They are very small images that only have a small amount of image on them; all pages with a goodly amount of content do not show up duplicated. After a careful review of the directory I found that I was correct and the image files themselves were NOT duplicated; only ST AND STU both show these duplicates only within the program. I am able to manually right click and "Remove from project...", but this is obviously some kind of bug.

I did discover that I had to, on Select Content, pick "Disable" on Content Box and apply to all pages in order to get smaller content area only pages to work properly. Essentially since ALL space is content with this no-margin version of the pages, I suppose this makes sense, but it was not at first intuitive and required a good bit of playing to just figure out that this was required.

I believe that after I get margins applied to all pages (after we get this crash resolved) and I bottom align the content so that page numbers are essentially the centerpoint, I'll be finished (since apparently I can't see Output until I have applied Page Layout properties to all pages)...

trufanov-nok commented 6 years ago

and I couldn't work with STU crashing.

Yep, my bad. I'll fix this today but not sure about dmg regeneration. So you better bypass this. To bypass the issue go to Settings and enable "Auto margins" in Page layout\Margins. With this feature enable STU won't crash.

trufanov-nok commented 6 years ago

Regarding duplicated pages: are you sure they didn't get splitted into two pages at stage "2. Page Split". ST can automatically decide that page should be splited in halves if it's width twice bigger than height. In this case on following steps you'll see two identical thumbnails (till you click on them and page thumbnails are regenerated) and icon near filename that displays if this page left or right. To make sure all your pages are single pages you can go at stage 2 and apply manual page split type "single pages" to all pages. If it's manually set ST won't try to make any assumptions. I have no other ideas why this may happen. If you can make small reproducible project with 1-2 pages and send me project file with images then that could help.

ylluminate commented 6 years ago

WOW, had the most crazy thing happen with Preferences - when I opened them at first things seemed fine, but the 2nd time I opened them it began expanding to the right side indefinitely. I have multiple monitors and it just kept expanding out onto the other displays until I hit the ESCAPE key to close the dialogue. I was able to do this repeatedly: example

trufanov-nok commented 6 years ago

WOW, had the most crazy thing happen with Preferences

Can't reproduce in virtual machine. Could you check it's not related to the fact you have multiple monitors by power offing some of them?

ylluminate commented 6 years ago

So it seems to do it anyway. I'll have to try to test it on a 10.13 virtual machine at some point.

Yes, I saw the pages apparently were attempting to be split, but it was duplicating the content. After going to these pages, I saw that option 3 was highlighted. Upon selecting option 1, it seems to have disappeared. It was only happening on very small images.  I've created a sample and sent it to you. Oddly not all of the pages reproduced the split in the new test project, but it should give you a clear view. Not a big deal, but it's obviously strange.

Otherwise I managed to get the rest of the images out and this looks good after modifying the preferences as per above. I was able to make a change to the preferences WHILE it was sliding out too. LOL

trufanov-nok commented 6 years ago

Yes, I saw the pages apparently were attempting to be split, but it was duplicating the content. After going to these pages, I saw that option 3 was highlighted. Upon selecting option 1, it seems to have disappeared. It was only happening on very small images. I've created a sample and sent it to you. Oddly not all of the pages reproduced the split in the new test project, but it should give you a clear view. Not a big deal, but it's obviously strange.

yep, that's what i thought. It's not a bug. ST tries to detect two-page scans by comparing their width and height. So it's not about the fact that your images are small. It's because images contains only content and with small number of lines on page the resulting image's height will be much smaller than its width. The way we use ST now is a hack -it's not designed for work with content zones. It's designed for scans containing whole page. So this behavior is fine.

So we have 3 problems for now:

  1. Crash, that I know how to fix.
  2. Missize of settings window that you better check on single screen machine.
  3. I'm about to resolve homebrew staff.
trufanov-nok commented 6 years ago

Updated MAC ver ScanTailorUniversal-0.2.3.zip