IMPLEMENTATION_ERROR when processing some Unicode characters

curtkohler commented 5 years ago

We've been running some text blocks through quantities and have had a number of them generate IMPLEMENTATION_ERROR tokens got got dissynchronized with tokenizations ... errors. We were able to replicate this by feeding the text in question into the Quantities server at: http://cloud.science-miner.com/quantity/

Digging into this a bit, the text in questions seem to have a number of Unicode characters in the 20xx range that seem to cause problems.

Two sample text fragments that will generate the errors when run through the Quantities server:

π–π
O14―H14A

In the case of π–π, the error message is:

Requesting server... Error encountered while requesting the server. IMPLEMENTATION ERROR: tokens (at pos: 1) got dissynchronized with tokenizations (at pos: 1 ) labelsAndTokens +-: 'π' --> '-' 'π' tokenizations +-: 'π' --> '–' 'π'

Looking closely, you will notice the two lists appear to have different tokens in them (the '-' in the tokenizations list is longer than the one in labelsAndTokens.

Decoding the Unicode error message confirms this:

labelsAndTokens%20+-%3A%20%27%u03C0%27%20--%3E%09%27-%27%20%27%u03C0%27%20tokenizations%20+-%3A%20%27%u03C0%27%20--%3E%09%27%u2013%27%20%27%u03C0%27

Perhaps two different code paths are generating slightly different lists in some cases and this is the cause of the two lists getting out of sync?

lfoppiano commented 5 years ago

Dear @curtkohler sorry for the late answer.

Could you please provide more information about the data you were trying to process? Was it pdf (if so could you share it?) or text?

Did you run it only on the science-miner server or did you test it locally as well? AFAIK the version on master is a more recent version than the science-miner one.

curtkohler commented 5 years ago

No problem on the delayed reply,

One of the people in my group noticed some issues while running a number of sentences extracted from scientific texts through a local installation (no PDFs in play). Text should have all been UTF-8 encoded I believe, as it was extracted from internal XML markup. He contacted me (as I’ve done some investigation into Grobid) after he had a run of 34K sentences that had @ 1800 with this type of error coming back. Looking at the machine where the local Grobid/Quantities server is running, it looks like Grobid is at version 0.5.3 and the Quantities.zip was downloaded on Jan 3rd 2019. I don’t seen an obvious version on the Quantities install, but I would assume it was the latest version available on Jan 3rd. The science-miner service was exhibiting the same errors he was seeing on the local install and easily re-creatable, which was why I referenced it in the issue. The short snippets in the issue are the very targeted spans of the sentence where the errors are being generated. Here are the actual corresponding sentences that were being processed:

(c) The alternative neighbor chains connected by H-bonding and π–π interaction in the crystal lattice of complex 1 viewed from b axis.

The blue lines stand for H-bonding O11―H11A…O5 and O9―H9A…O6, which form a 1D right-handed helical chain along b axis; the yellow lines stand for H-bonding O8―H80…O14, O14―H14B…O4 and O9―H9B…O4, which link the neighbor 1D helical chains in right-handed to form the 2D sheets; the pink lines stand for H-bonding O14―H14A…O15 and O14―H14B…O4 and π–π interaction, which link the 2D sheets to form 3D supramolecular framework.

And some additional ones if you need them (that also have quantities in them). Note, most of the problems I’ve seen seem to be around the processing of “dash” type entities.

The molecules are linked by a strong H-bonding, O11―H11A…O5 (1.845 Å, 2.657 Å, 166.64°) and O9―H9A…O6 (1.945 Å, 2.755 Å, 168.27°) ((a)) to form a 1D right-handed helical chain along b axis.

AFM topography images of (c) AALD ZnO deposited in 80 cycles and (d) the ITO substrate, and (e) J–V curve of a device with 125 nm thick AALD ZnO grown in 4.5 min and tested after one week of storage in air in the dark.

The 3DAP observation conditions were as follows: laser energy of 0.5–0.6 nJ, a laser-pulse repetition rate of 250 kHz, a DC bias voltage of 4–7 kV, and a specimen temperature of 55 K.

Curt

From: Luca Foppiano notifications@github.com Reply-To: kermitt2/grobid-quantities reply@reply.github.com Date: Saturday, February 9, 2019 at 9:49 PM To: kermitt2/grobid-quantities grobid-quantities@noreply.github.com Cc: "Kohler, Curt E. (ELS-HBE)" C.kohler@elsevier.com, Mention mention@noreply.github.com Subject: Re: [kermitt2/grobid-quantities] IMPLEMENTATION_ERROR when processing some Unicode characters (#83)

External email: use caution

Dear @curtkohlerhttps://github.com/curtkohler sorry for the late answer.

Could you please provide more information about the data you were trying to process? Was it pdf (if so could you share it?) or text?

Did you run it only on the science-miner server or did you test it locally as well? AFAIK the version on master is a more recent version than the science-miner one.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/kermitt2/grobid-quantities/issues/83#issuecomment-462098757, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAiDEeGqQoWchvPCEZEoDb3Kq2zPO6qkks5vL4iwgaJpZM4ah7DV.

lfoppiano commented 5 years ago

Dear @curtkohler, yes indeed the problem are generated by the dashes, which are somehow replaced with standard dashes and therefore they are not matching anymore and the flow get desynchronised.

I will try to fix this; for the time being I recommend you to replace the hypens with standard ones in the input.

kermitt2 commented 5 years ago

Hello!

The input is going through a unicode normalisation which maps classes of characters to a canonical representation. So all the dashes are mapped to the simple one "-". This process is realized by the GROBID class org.grobid.core.utilities.UnicodeUtil.java via the Java dash punctuation property:

 text = text.replaceAll("\\p{Pd}", "-");

The unicode normalization is incorrectly located in this repo, it should be directly applied to the input string, before generating the features. Here we have on one hand the non-unicode normalized (list of LayoutToken) and on the other hand the unicode normalised at features level (so from the CRF results). We need to apply the unicode normalization to both! (I actually made this error)

For PDF input, the normalisation takes place in grobid core just after parsing the PDF so you won't have this problem.

kermitt2 commented 5 years ago

For clarity:

unicode normalization has to be applied everywhere (it's important for PDF input, and as the models are trained with normalized strings, the string input has also to be unicode normalised)
@lfoppiano just to be sure, you could double check that all training input are also applying this unicode normalization for all grobid-quantities models

lfoppiano commented 5 years ago

@curtkohler the last commit should fix the problem. The normalisation is now done before extracting the features.

I had also added in the Trainers, I will re-retrain and update this issue when the new model will be pushed.

lfoppiano commented 5 years ago

I've modified the normaliseTextAndRemoveSpace with normaliseText because the first is destroying the layoutToken spaces...

lfoppiano commented 5 years ago

@curtkohler could you help me with some tests to see whether the text can be successfully parsed?

curtkohler commented 5 years ago

Do you just example problem data for test cases, or do you need me to run some tests? If the later, I can probably get to it later in the week…

From: Luca Foppiano notifications@github.com Reply-To: kermitt2/grobid-quantities reply@reply.github.com Date: Monday, March 11, 2019 at 1:24 AM To: kermitt2/grobid-quantities grobid-quantities@noreply.github.com Cc: "Kohler, Curt E. (ELS-HBE)" C.kohler@elsevier.com, Mention mention@noreply.github.com Subject: Re: [kermitt2/grobid-quantities] IMPLEMENTATION_ERROR when processing some Unicode characters (#83)

External email: use caution

@curtkohlerhttps://github.com/curtkohler could you help me with some tests to see whether the text can be successfully parsed?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/kermitt2/grobid-quantities/issues/83#issuecomment-471408323, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAiDEfxh6v9QGtCaVudURt_6bVzxlIMQks5vVehzgaJpZM4ah7DV.

lfoppiano commented 5 years ago

@curtkohler no worries, is not urgent, just an heads-up to eventually get some feedback.

I've done already some tests with your examples and they were working. 👍

lfoppiano commented 5 years ago

I've re-trained the three models and pushed them. You could get them by pulling and then running ./gradlew copyModels

curtkohler commented 5 years ago

I ran about 25-30 of the problem paragraphs from the set we had identified through the new code and everything appears to be working as it should. Thanks for fixing the issue.

Curt

lfoppiano / grobid-quantities

IMPLEMENTATION_ERROR when processing some Unicode characters #83