monniert / docExtractor

(ICFHR 2020 oral) Code for "docExtractor: An off-the-shelf historical document element extraction" paper
https://www.tmonnier.com/docExtractor
MIT License
85 stars 10 forks source link

Demo website down #18

Closed sjscotti closed 2 years ago

sjscotti commented 2 years ago

Hi! I read your paper and viewed your video with interest, and I would like to explore using your code for my application - getting layout segmentation from ~100-year-old newspapers. So I downloaded the repo, but in trying to set up the Anaconda environment, I discovered that you are using a number of dependencies that are Linux specific and not available for Windows. If there are no versions available for Windows, I can set up Windows Subsystem for Linux (WSL) and use it that way. But I really would like to see how your code can handle some of examples of images of newspaper pages before I go to the trouble of setting WSL up. So I went to your demo website - https://enherit.paris.inria.fr/ to see if I could use it for this evaluation - but it is down. Could you please establish a new demo website so I can evaluate your repo? Thanks!

monniert commented 2 years ago

Hi @sjscotti, thanks for raising the issue!

Yes our demo website is down and we don't have the resources to make it work again yet... Nonetheless, I can try to run a couple of extractions for you when I have time; you can send me 10 images in jpg or png format by email at tom.monnier@enpc.fr and and will forward you the raw results.

Thanks, Tom

sjscotti commented 2 years ago

Thanks Tom! Could you work with .jp2 files (jpeg2000)? My images are in this format. If not, I'll find a way to convert them to jpeg or png. Regards -Steve

monniert commented 2 years ago

Yes I think it can work, otherwise I will convert them

Tom

sjscotti commented 2 years ago

Thanks Tom I just emailed you the files to the email address you provided. Your offering to do this test is much appreciated! Regards -Steve

On Tue, May 24, 2022 at 11:00 AM monniert @.***> wrote:

Yes I think it can work, otherwise I will convert them

Tom

— Reply to this email directly, view it on GitHub https://github.com/monniert/docExtractor/issues/18#issuecomment-1136042523, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKUVG7HYHQREPUVUFM6UCF3VLTVHNANCNFSM5WTOBQMQ . You are receiving this because you were mentioned.Message ID: @.***>

--

A person who never made a mistake never tried anything new. -- Albert Einstein

Well done is better than well said. -- Benjamin Franklin

The essence of faith is fewness of words and abundance of deeds. -- Bahá´u´lláh


sjscotti commented 2 years ago

Hi Tom To get around the issue of setting up a Linux capability on my Windows machine, I got the idea of using your demo.ipynb notebook on Google colab. I got it to run and tried out your code on one of the images I emailed you. Beyond the images being .jp2 format, I found that I needed to convert them from grayscale to RGB for them to run correctly. So I converted them in GIMP and exported the image to .png format to do a test. I did get results but they were not very good because my images have a long dimension of 6720 pixels which is scaled down to 1280 in the code (5.25x smaller!). When I cut a section of the image out that was about 1280x1280, and ran it through the code, it detected lines of text nicely. So I was encouraged by that. Is there a way that your code can easily be modified to handle much larger images without downscaling them?

monniert commented 2 years ago

Hi Steve, oh yes indeed nice workaround doing it through Colab! This is indeed an issue, the neural network has been trained on images of size 1280 (roughly) so it cannot handle other magnitudes of size. Depending on your application you either want to downscale it globally (this is what is done in the current pipeline and works in most cases), or apply the extraction on crops of your original images (this would typically be the case if your HD image has a lot of small and compact contents). To do so, it is quite easy, you can preprocess your data into overlapping crops and gather them in a folder, apply the extraction on the resulting crops, and merge the results. The last step would require a little work but I think it can easily be done.

As you succeeded in running the extraction, I suppose that doing the extraction on my side would not provide you additional insights. Let me know if you need more help!

monniert commented 2 years ago

@sjscotti Since you seem to have figured out a solution, I am closing the issue for now; let me know if you need more help