gsautter / goldengate-imagine

Automatically exported from code.google.com/p/goldengate-imagine
Other
1 stars 0 forks source link

Insects of Guam 10.5281/zenodo.3634035 Thysanoptera #886

Open myrmoteras opened 4 years ago

myrmoteras commented 4 years ago

FFDF7A4CFFF3FFC777195003DD64051A

this is the processed document b172p7-16.pdf

@gsautter you might want to run this from scratch on your machine, sine there have been some issue I had to correct manually before I then uploaded it to TB

  1. I opened it as scanned PDF
  2. There are some OCR issues. Check only the taxonomic names for the treatments nomenclature and refgroup and see some OCR issues
  3. The bibRefs are in the style like in Botany and thus this might be an example to keep an eye on as zoological example
  4. After running the taxonomic name tagger, check the names, which are missing and which not. Try to tagg, and then use taxonomic name parser. This does not work
  5. TaxonomicNameLebel in most cases not discovered
  6. How much of this artiicle could be fixed and do you think we could write a template for this journal? http://hbs.bishopmuseum.org/pubs-online/bpbm-bulletins.html

For Audrey: right now this is processable, but it needs some expertise, but then it provides new species that are not in Catalogue of Life nor GBIF: https://www.gbif.org/dataset/9c8d5683-76c1-4938-aede-b7ad5391b6b2 and Zenodo: https://zenodo.org/record/3634035#.Xjc3iWhKjAQ and TreatmentBank: http://treatment.plazi.org/GgServer/summary/FFDF7A4CFFF3FFC777195003DD64051A

aubreymoore commented 4 years ago

Stuck at step 1. Cannot open a downloaded copy of b172.p7-16.pdf.

Details:

  1. Opened ggi with java -jar GgImagineStarter.jar.
  2. Allowed web access.
  3. Selected Local Master config.
  4. Selected File | Open document, selected PDF Documents (scanned), and navigated to b172.p7-16.pdf**.

When I click on Ok, I am returned to the main screen with 0 pages loaded.

Here's the error log:

Font 'FreeSerif' loaded successfully. Font 'FreeSerifItalic' loaded successfully. Font 'FreeSerifBold' loaded successfully. Font 'FreeSerifBoldItalic' loaded successfully. Font 'FreeSans' loaded successfully. Font 'FreeSansOblique' loaded successfully. Font 'FreeSansBold' loaded successfully. Font 'FreeSansBoldOblique' loaded successfully. Font 'FreeMono' loaded successfully. Font 'FreeMonoOblique' loaded successfully. Font 'FreeMonoBold' loaded successfully. Font 'FreeMonoBoldOblique' loaded successfully. Exception in thread "LoaderThread" java.lang.NoClassDefFoundError: com/sun/image/codec/jpeg/JPEGImageDecoder at org.icepdf.core.util.Parser.getObject(Parser.java:315) at org.icepdf.core.pobjects.Document.loadDocumentViaXRefs(Document.java:498) at org.icepdf.core.pobjects.Document.setInputStream(Document.java:421) at org.icepdf.core.pobjects.Document.setInputStream(Document.java:288) at de.uka.ipd.idaho.im.pdf.PdfExtractor.loadImagePdf(PdfExtractor.java:8590) at de.uka.ipd.idaho.im.pdf.PdfExtractor.loadImagePdf(PdfExtractor.java:8383) at de.uka.ipd.idaho.im.imagine.application.GoldenGateImagineUI$11.run(GoldenGateImagineUI.java:1190) Caused by: java.lang.ClassNotFoundException: com.sun.image.codec.jpeg.JPEGImageDecoder at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ... 7 more

aubreymoore commented 4 years ago

An explanation of what I am trying to do here.

I am attempting to build a biodiversity inventory for Guam starting with the insects. I have already got label data from the University of Guam Insect Collection online and ported to GBIF thanks to SCAN. I am now turning my attention to legacy literature starting with Insects of Guam I and II. I was able to provide some funding to help the Bishop Museum put these bulletins online at http://hbs.bishopmuseum.org/pubs-online/bpbm-bulletins.html. My goal is to free up the biodiversity in Insects of Guam by extracting data as Darwin core archives to be published on GBIF. GGI appears to be the best way to do this and I am very thankful that this tool has been made available.

Not sure about the suggestion of putting together a template for Insects of Guam because formatting seems to vary a lot among chapters.

gsautter commented 4 years ago

Now that @myrmoteras has decoded the underlying PDF into an IMF, you should not need to decode any PDFs any longer ... downloading the IMF and using "File > Open Document" with "Image Markup Files" should do (IMF strictly uses PNG for bitmap images, which is not the most efficient, but both open and widely supported).

Apart from this, which version of Java are your running? Simply type java -version in your console to find out. We've been running PDF decoding on Linux systems for a while, but we never got the error you report, so this would be very helpful to know in order to improve Linux support.

aubreymoore commented 4 years ago

Hi Guido,

Here's my java version:

aubrey@aubrey-Latitude-7280:~$ java -version openjdk version "1.8.0_242" OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~16.04-b08) OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)

On Mon, Feb 3, 2020 at 5:40 PM Guido Sautter notifications@github.com wrote:

Now that @myrmoteras https://github.com/myrmoteras has decoded the underlying PDF into an IMF, you should not need to decode any PDFs any longer ... downloading the IMF and using "File > Open Document" with "Image Markup Files" should do (IMF strictly uses PNG for bitmap images, which is not the most efficient, but both open and widely supported).

Apart from this, which version of Java are your running? Simply type java -version in your console to find out. We've been running PDF decoding on Linux systems for a while, but we never got the error you report, so this would be very helpful to know in order to improve Linux support.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gsautter/goldengate-imagine/issues/886?email_source=notifications&email_token=AAIRST2BL4FDDHLYCBNKLSTRA7C7PA5CNFSM4KO4ODCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKSZVTQ#issuecomment-581278414, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIRST5RODBNRFADHQBEDCLRA7C7PANCNFSM4KO4ODCA .

myrmoteras commented 4 years ago

@gsautter there are >100 more PDFs to decode. see http://hbs.bishopmuseum.org/pubs-online/bpbm-bulletins.html. It thus would be good to find a solution, also in regards of above issues if possible

myrmoteras commented 4 years ago

@aubreymoore if you need some sort of introduction to GG use, it might be wise to spend a moment via skype to teach you the first steps?!

aubreymoore commented 4 years ago

Hi Donat,

I would like to take you up on your offer to coach me via Skype. But I think I need to get GG working on my machine before this would be of benefit.

I am thinking running GG in a Docker container might be a potential solution. Or, alternatively, using an online version if you have one.

Thanks to you and Guido for the help.

All the Best,

On Mon, Feb 3, 2020 at 6:05 PM Donat Agosti notifications@github.com wrote:

@aubreymoore https://github.com/aubreymoore if you need some sort of introduction to GG use, it might be wise to spend a moment via skype to teach you the first steps?!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gsautter/goldengate-imagine/issues/886?email_source=notifications&email_token=AAIRSTZ3F6IQG75QEJVGR7DRA7F5XA5CNFSM4KO4ODCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKS3RXY#issuecomment-581286111, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIRST2XE6YZUAMYLHVGKW3RA7F5XANCNFSM4KO4ODCA .

myrmoteras commented 4 years ago

Hi @aubreymoore We are setting up an repo that deals with fellows that want to learn GG. In fact we are now very close to offer a training using screencasts following a checklist of what steps need to be taken to get a document processed. @mguidoti will help you set up your account at https://github.com/plazi/learning

gsautter commented 4 years ago

Thanks for the stack trace and the Java version info ... looks like the third-party PDF library I'm using for basic data structures and stream decoders (IcePDF, in its royalty free version) has some dependency on an Oracle JDK specific class ... very bad style, especially the absence of any kind of fallback for other JDKs (GGI has something similar for the initialization of its HTTPS certificate store, but it does have a fallback, which obviously kicked in and did the trick on your machine).

gsautter commented 4 years ago

Now there are two ways of dealing with this situation:

Unless you have policy issues, I'd recommend the latter, even though I'm not fond at all of the new license Oracle has been imposing since spring 2019 ... you can use an earlier JVM, of course, as GGI does not depend on any recent Java features.

gsautter commented 4 years ago

Regarding annotating and parsing taxon names originally missed by FAT: turns out the parser works quite well once you get OCR fixed, as long as you fix the italics property as well (Ctrl+I in the OCR line editing widget) ... the parser still insists on both the bold and the italics property matching between genus and species. We might loosen this restriction, but that might also end up incurring errors.

gsautter commented 4 years ago

Regarding the in-line bibliographic references: this is a challenge we yet have start tackling. This is also important for FAT because taxon name authorities more often than never overshoot to include parts of the reference journal name if there is no separating punctuation mark other than a comma.

Unfortunately, I hardly see a generic take on this so far, apart from a (never complete) lexicon of journal names. If the there is more than a comma to separate the taxon authors from the journal name, we should also be able to exploit that and use patterns, with a good chance of populating or continuously amending a lexicon, but if a comma is all we have, said lexicon is the only viable approach I see at this point.

gsautter commented 4 years ago

Regarding the taxon name status labels that failed to annotate, I tend to think this is a downstream effect of the names proper failing to annotate, as the labels annotate only in immediate succession to names.

gsautter commented 4 years ago

About making a template for this journal: reading through http://hbs.bishopmuseum.org/pubs-online/bpbm-bulletins.html , this journal looks quite broadly scoped to me, so I don't expect all too many articles to actually contain treatments, not sure whether or not this justifies the effort for making a template.

Apart from that, the layout of the treatments is bothering me a little bit, specifically the indentation change between reference group and the rest of the treatment (materials citations, discussion, etc.). This is not a problem in original descriptions, obviously, but in the other treatments. Apart from this, the inherently inaccurate font sizes of OCR output might also pose problematic for heading detection, even though the numbered treatment headings would likely prove quite advantageous.

All in all, it might be possible to make a template for this journal, but whether or not its worthwhile in terms of (a) the treatments we get from it and (b) the effort we save in comparison to getting those same treatments without a template, that's a completely different cup of tea.

gsautter commented 4 years ago

Regarding OCR, inaccurate word boundaries pose a recurring problem. In particular, if word boundaries are off, there is no way of accurately getting the image cut-out of the originally printed word, which severely impacts detection of bold face and italics. Errors in the latter department then get in the way of heading detection, taxon name markup, etc.,

That said, we need to do something about inaccurate word bounderies if we seriously want to use templates for OCR output, as only that ensures we can rely on those very handy font properties in downstream processing.

myrmoteras commented 4 years ago

@aubreymoore OK - let me know when your are ready- I will be traveling next week and not sure, whether and how good I can connect. So may be better to try before the end of the week?

aubreymoore commented 4 years ago

I still cannot get GGI to open scanned PDFs. I removed OpenJDK and installed Oracle java. Not easy at all. The only version I could get working was Java 13:

aubrey@aubrey-Latitude-7280:~$ java -version java version "13.0.2" 2020-01-14 Java(TM) SE Runtime Environment (build 13.0.2+8) Java HotSpot(TM) 64-Bit Server VM (build 13.0.2+8, mixed mode, sharing)

When I open a local copy of b172.p7-16.pdf, I get what appear to be the same errors as before when I was running OpenJDK:

ont 'FreeSerif' loaded successfully. Font 'FreeSerifItalic' loaded successfully. Font 'FreeSerifBold' loaded successfully. Font 'FreeSerifBoldItalic' loaded successfully. Font 'FreeSans' loaded successfully. Font 'FreeSansOblique' loaded successfully. Font 'FreeSansBold' loaded successfully. Font 'FreeSansBoldOblique' loaded successfully. Font 'FreeMono' loaded successfully. Font 'FreeMonoOblique' loaded successfully. Font 'FreeMonoBold' loaded successfully. Font 'FreeMonoBoldOblique' loaded successfully. Exception in thread "LoaderThread" java.lang.NoClassDefFoundError: com/sun/image/codec/jpeg/JPEGImageDecoder at org.icepdf.core.util.Parser.getObject(Parser.java:315) at org.icepdf.core.pobjects.Document.loadDocumentViaXRefs(Document.java:498) at org.icepdf.core.pobjects.Document.setInputStream(Document.java:421) at org.icepdf.core.pobjects.Document.setInputStream(Document.java:288) at de.uka.ipd.idaho.im.pdf.PdfExtractor.loadImagePdf(PdfExtractor.java:8596) at de.uka.ipd.idaho.im.pdf.PdfExtractor.loadImagePdf(PdfExtractor.java:8389) at de.uka.ipd.idaho.im.imagine.application.GoldenGateImagineUI$11.run(GoldenGateImagineUI.java:1190) Caused by: java.lang.ClassNotFoundException: com.sun.image.codec.jpeg.JPEGImageDecoder at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:602) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521) ... 7 more

Any ideas how to proceed?

gsautter commented 4 years ago

I'm in the process of switching to a more recent version of IcePDF that does not have this pathological dependency on JPEGImageDecoder, in the hope that this will finally resolve the issue. Still have a few tests to run before I can put this into a build, though ... will let you know when it becomes available.

aubreymoore commented 4 years ago

Thanks for the update. I like your term "pathological dependency".