Open myrmoteras opened 4 years ago
Stuck at step 1. Cannot open a downloaded copy of b172.p7-16.pdf.
Details:
java -jar GgImagineStarter.jar
.When I click on Ok, I am returned to the main screen with 0 pages loaded.
Here's the error log:
Font 'FreeSerif' loaded successfully. Font 'FreeSerifItalic' loaded successfully. Font 'FreeSerifBold' loaded successfully. Font 'FreeSerifBoldItalic' loaded successfully. Font 'FreeSans' loaded successfully. Font 'FreeSansOblique' loaded successfully. Font 'FreeSansBold' loaded successfully. Font 'FreeSansBoldOblique' loaded successfully. Font 'FreeMono' loaded successfully. Font 'FreeMonoOblique' loaded successfully. Font 'FreeMonoBold' loaded successfully. Font 'FreeMonoBoldOblique' loaded successfully. Exception in thread "LoaderThread" java.lang.NoClassDefFoundError: com/sun/image/codec/jpeg/JPEGImageDecoder at org.icepdf.core.util.Parser.getObject(Parser.java:315) at org.icepdf.core.pobjects.Document.loadDocumentViaXRefs(Document.java:498) at org.icepdf.core.pobjects.Document.setInputStream(Document.java:421) at org.icepdf.core.pobjects.Document.setInputStream(Document.java:288) at de.uka.ipd.idaho.im.pdf.PdfExtractor.loadImagePdf(PdfExtractor.java:8590) at de.uka.ipd.idaho.im.pdf.PdfExtractor.loadImagePdf(PdfExtractor.java:8383) at de.uka.ipd.idaho.im.imagine.application.GoldenGateImagineUI$11.run(GoldenGateImagineUI.java:1190) Caused by: java.lang.ClassNotFoundException: com.sun.image.codec.jpeg.JPEGImageDecoder at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ... 7 more
An explanation of what I am trying to do here.
I am attempting to build a biodiversity inventory for Guam starting with the insects. I have already got label data from the University of Guam Insect Collection online and ported to GBIF thanks to SCAN. I am now turning my attention to legacy literature starting with Insects of Guam I and II. I was able to provide some funding to help the Bishop Museum put these bulletins online at http://hbs.bishopmuseum.org/pubs-online/bpbm-bulletins.html. My goal is to free up the biodiversity in Insects of Guam by extracting data as Darwin core archives to be published on GBIF. GGI appears to be the best way to do this and I am very thankful that this tool has been made available.
Not sure about the suggestion of putting together a template for Insects of Guam because formatting seems to vary a lot among chapters.
Now that @myrmoteras has decoded the underlying PDF into an IMF, you should not need to decode any PDFs any longer ... downloading the IMF and using "File > Open Document" with "Image Markup Files" should do (IMF strictly uses PNG for bitmap images, which is not the most efficient, but both open and widely supported).
Apart from this, which version of Java are your running? Simply type java -version
in your console to find out. We've been running PDF decoding on Linux systems for a while, but we never got the error you report, so this would be very helpful to know in order to improve Linux support.
Hi Guido,
Here's my java version:
aubrey@aubrey-Latitude-7280:~$ java -version openjdk version "1.8.0_242" OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~16.04-b08) OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)
On Mon, Feb 3, 2020 at 5:40 PM Guido Sautter notifications@github.com wrote:
Now that @myrmoteras https://github.com/myrmoteras has decoded the underlying PDF into an IMF, you should not need to decode any PDFs any longer ... downloading the IMF and using "File > Open Document" with "Image Markup Files" should do (IMF strictly uses PNG for bitmap images, which is not the most efficient, but both open and widely supported).
Apart from this, which version of Java are your running? Simply type java -version in your console to find out. We've been running PDF decoding on Linux systems for a while, but we never got the error you report, so this would be very helpful to know in order to improve Linux support.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gsautter/goldengate-imagine/issues/886?email_source=notifications&email_token=AAIRST2BL4FDDHLYCBNKLSTRA7C7PA5CNFSM4KO4ODCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKSZVTQ#issuecomment-581278414, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIRST5RODBNRFADHQBEDCLRA7C7PANCNFSM4KO4ODCA .
@gsautter there are >100 more PDFs to decode. see http://hbs.bishopmuseum.org/pubs-online/bpbm-bulletins.html. It thus would be good to find a solution, also in regards of above issues if possible
@aubreymoore if you need some sort of introduction to GG use, it might be wise to spend a moment via skype to teach you the first steps?!
Hi Donat,
I would like to take you up on your offer to coach me via Skype. But I think I need to get GG working on my machine before this would be of benefit.
I am thinking running GG in a Docker container might be a potential solution. Or, alternatively, using an online version if you have one.
Thanks to you and Guido for the help.
All the Best,
On Mon, Feb 3, 2020 at 6:05 PM Donat Agosti notifications@github.com wrote:
@aubreymoore https://github.com/aubreymoore if you need some sort of introduction to GG use, it might be wise to spend a moment via skype to teach you the first steps?!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gsautter/goldengate-imagine/issues/886?email_source=notifications&email_token=AAIRSTZ3F6IQG75QEJVGR7DRA7F5XA5CNFSM4KO4ODCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKS3RXY#issuecomment-581286111, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIRST2XE6YZUAMYLHVGKW3RA7F5XANCNFSM4KO4ODCA .
Hi @aubreymoore We are setting up an repo that deals with fellows that want to learn GG. In fact we are now very close to offer a training using screencasts following a checklist of what steps need to be taken to get a document processed. @mguidoti will help you set up your account at https://github.com/plazi/learning
Thanks for the stack trace and the Java version info ... looks like the third-party PDF library I'm using for basic data structures and stream decoders (IcePDF, in its royalty free version) has some dependency on an Oracle JDK specific class ... very bad style, especially the absence of any kind of fallback for other JDKs (GGI has something similar for the initialization of its HTTPS certificate store, but it does have a fallback, which obviously kicked in and did the trick on your machine).
Now there are two ways of dealing with this situation:
Unless you have policy issues, I'd recommend the latter, even though I'm not fond at all of the new license Oracle has been imposing since spring 2019 ... you can use an earlier JVM, of course, as GGI does not depend on any recent Java features.
Regarding annotating and parsing taxon names originally missed by FAT: turns out the parser works quite well once you get OCR fixed, as long as you fix the italics property as well (Ctrl+I in the OCR line editing widget) ... the parser still insists on both the bold and the italics property matching between genus and species. We might loosen this restriction, but that might also end up incurring errors.
Regarding the in-line bibliographic references: this is a challenge we yet have start tackling. This is also important for FAT because taxon name authorities more often than never overshoot to include parts of the reference journal name if there is no separating punctuation mark other than a comma.
Unfortunately, I hardly see a generic take on this so far, apart from a (never complete) lexicon of journal names. If the there is more than a comma to separate the taxon authors from the journal name, we should also be able to exploit that and use patterns, with a good chance of populating or continuously amending a lexicon, but if a comma is all we have, said lexicon is the only viable approach I see at this point.
Regarding the taxon name status labels that failed to annotate, I tend to think this is a downstream effect of the names proper failing to annotate, as the labels annotate only in immediate succession to names.
About making a template for this journal: reading through http://hbs.bishopmuseum.org/pubs-online/bpbm-bulletins.html , this journal looks quite broadly scoped to me, so I don't expect all too many articles to actually contain treatments, not sure whether or not this justifies the effort for making a template.
Apart from that, the layout of the treatments is bothering me a little bit, specifically the indentation change between reference group and the rest of the treatment (materials citations, discussion, etc.). This is not a problem in original descriptions, obviously, but in the other treatments. Apart from this, the inherently inaccurate font sizes of OCR output might also pose problematic for heading detection, even though the numbered treatment headings would likely prove quite advantageous.
All in all, it might be possible to make a template for this journal, but whether or not its worthwhile in terms of (a) the treatments we get from it and (b) the effort we save in comparison to getting those same treatments without a template, that's a completely different cup of tea.
Regarding OCR, inaccurate word boundaries pose a recurring problem. In particular, if word boundaries are off, there is no way of accurately getting the image cut-out of the originally printed word, which severely impacts detection of bold face and italics. Errors in the latter department then get in the way of heading detection, taxon name markup, etc.,
That said, we need to do something about inaccurate word bounderies if we seriously want to use templates for OCR output, as only that ensures we can rely on those very handy font properties in downstream processing.
@aubreymoore OK - let me know when your are ready- I will be traveling next week and not sure, whether and how good I can connect. So may be better to try before the end of the week?
I still cannot get GGI to open scanned PDFs. I removed OpenJDK and installed Oracle java. Not easy at all. The only version I could get working was Java 13:
aubrey@aubrey-Latitude-7280:~$ java -version java version "13.0.2" 2020-01-14 Java(TM) SE Runtime Environment (build 13.0.2+8) Java HotSpot(TM) 64-Bit Server VM (build 13.0.2+8, mixed mode, sharing)
When I open a local copy of b172.p7-16.pdf, I get what appear to be the same errors as before when I was running OpenJDK:
ont 'FreeSerif' loaded successfully. Font 'FreeSerifItalic' loaded successfully. Font 'FreeSerifBold' loaded successfully. Font 'FreeSerifBoldItalic' loaded successfully. Font 'FreeSans' loaded successfully. Font 'FreeSansOblique' loaded successfully. Font 'FreeSansBold' loaded successfully. Font 'FreeSansBoldOblique' loaded successfully. Font 'FreeMono' loaded successfully. Font 'FreeMonoOblique' loaded successfully. Font 'FreeMonoBold' loaded successfully. Font 'FreeMonoBoldOblique' loaded successfully. Exception in thread "LoaderThread" java.lang.NoClassDefFoundError: com/sun/image/codec/jpeg/JPEGImageDecoder at org.icepdf.core.util.Parser.getObject(Parser.java:315) at org.icepdf.core.pobjects.Document.loadDocumentViaXRefs(Document.java:498) at org.icepdf.core.pobjects.Document.setInputStream(Document.java:421) at org.icepdf.core.pobjects.Document.setInputStream(Document.java:288) at de.uka.ipd.idaho.im.pdf.PdfExtractor.loadImagePdf(PdfExtractor.java:8596) at de.uka.ipd.idaho.im.pdf.PdfExtractor.loadImagePdf(PdfExtractor.java:8389) at de.uka.ipd.idaho.im.imagine.application.GoldenGateImagineUI$11.run(GoldenGateImagineUI.java:1190) Caused by: java.lang.ClassNotFoundException: com.sun.image.codec.jpeg.JPEGImageDecoder at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:602) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521) ... 7 more
Any ideas how to proceed?
I'm in the process of switching to a more recent version of IcePDF that does not have this pathological dependency on JPEGImageDecoder
, in the hope that this will finally resolve the issue. Still have a few tests to run before I can put this into a build, though ... will let you know when it becomes available.
Thanks for the update. I like your term "pathological dependency".
FFDF7A4CFFF3FFC777195003DD64051A
this is the processed document b172p7-16.pdf
@gsautter you might want to run this from scratch on your machine, sine there have been some issue I had to correct manually before I then uploaded it to TB
For Audrey: right now this is processable, but it needs some expertise, but then it provides new species that are not in Catalogue of Life nor GBIF: https://www.gbif.org/dataset/9c8d5683-76c1-4938-aede-b7ad5391b6b2 and Zenodo: https://zenodo.org/record/3634035#.Xjc3iWhKjAQ and TreatmentBank: http://treatment.plazi.org/GgServer/summary/FFDF7A4CFFF3FFC777195003DD64051A