jwilk / didjvu

DjVu encoder with foreground/background separation
https://jwilk.net/software/didjvu
GNU General Public License v2.0
10 stars 8 forks source link

DjVu-binarization does not invert parts of the page that have white text on a colored background #21

Open rmast opened 3 years ago

rmast commented 3 years ago

If you print and scan this document: https://www.kvk.nl/download/Formulier-14-wijziging-ondernemings-en-vestigingsgegevens_tcm109-365607.pdf the resulting DJBZ of didjvu with its default djvu-binarizer contains inverted tiles of text, while DjVuSolo3.1 inverts those, to only maintain the smaller elements within a big surface and use the background color for the colored frame.

I already read a probably expired patent mentioning this, so this should be getting attention. didjvu uses Gamera (include/plugins/threshold.hpp) djvu_threshold for this binarization, so this issue should probably be propagated to gamera.

rmast commented 3 years ago

@jsbien Patent https://patents.google.com/patent/US6901169 seems to deal with the choice between foreground and background. I've spent some time to understand what they do exactly, but I don't.

I just looked at the status of the patent, it is still active, so no use of implementing or studying it I guess.

jsbien commented 3 years ago

@rmast "Status Active, 2023-11-30 Adjusted expiration". Does it expire in two years or to the contrary, it's a date when the active status can be prolonged??? On the other hand, all USA software patents are vali only in USA, so if you want to use the software elsewhere than they do not matter - am I correct? BTW, I found my notes, but they require some checking and editing before making it public.

rmast commented 3 years ago

I can't imagine the American Software patents wouldn't be valid in Europe or even in Korea. There are patent struggles between Apple and the Korean Samsung for example. Otherwise someone could just use an offshore-company to break the patent.

I can imagine those rules are subject to international trade agreements.

However, a European or US patent lasts 20 years from the 'filing' date: https://www.bardehle.com/europeansoftwarepatents/faq/how-long-does-a-software-patent-last/ https://www.stopfakes.gov/article?id=How-Long-Does-Patent-Trademark-or-Copyright-Protection-Last

So you're right, this patent will probably expire soon. However I see a word 'filed' with date 2002 january 24, so 2023 november 30 is more than 20 years from that apparent filing date.

Your documentation states this patent has an excellent performance in front/back separation.

DjVuSolo3.1's inversion is far from perfect, so I doubt DjVuSolo 3.1 already contains this patented algorithm.

So, would you suggest to start coding to be able to have something productional in two years?

jsbien commented 3 years ago

In principle patents are valid only in the countries where they were explicitly patented, but you are right some trade agreements can affect it. I am quite sure they are not valid in the countries which do not recognize the software patents. A useful list is available at https://en.wikipedia.org/wiki/Software_patent. You are right European Union has a kind of software patents - I remembered the hot discussions but forgot the idea has been finally accepted. Now , besides USA, the patent in question is active in Europe: https://patents.google.com/patent/EP1229495B1/en (2022-01-31 Anticipated expiration), in Canada: https://patents.google.com/patent/CA2369841C/en (2022-01-31 Anticipated expiration), perhaps in South Korea: https://patents.google.com/patent/KR100873583B1/en (no explicit information on status and the expiration date), it is not relevant elsewhere unless some trade agreement says differently.

This is a very good example of patent created FUD (https://en.wikipedia.org/wiki/Fear,_uncertainty,_and_doubt).

To be on the safe side you can contact Current Assignee (T&T Corp, AT&T Intellectual Property II LP) and/or ask for help SFLC (https://en.wikipedia.org/wiki/Software_Freedom_Law_Center) or another similar organization.

rmast commented 3 years ago

If there exists a european patent as well you're right that the American patent probably doesn't cover Europe, otherwise they wouldn't have spent that double effort. And as the European patent already expires in two months we even won't be able to realize a realistic violation in time within our spare time.

rmast commented 3 years ago

The text for the European patent seems to differ from the American patent, so it might clarify some things. They talk about getting things done in very few passes, and talk about choosing foreground/background before even choosing what parts will join in estimating the background color. So the binarization, foreground/background estimation and color histogram determination are all done simultaneously. I guess we will have to focus on the Gamera (include/plugins/threshold.hpp) djvu_threshold to put it all in.

rmast commented 3 years ago

After reading some about the history of DjVuSolo 3.1 I now believe it does contain the patented algorithm. So behaving poorly on folded and inkjet-printed content, it probably needs another or additional strategy to get that content readable. image Simply thresholding the inverted text at 160 gives a better readable result: Knipsel