reynoldsnlp / flair

fork from the FLAIR project at Tuebingen University
Other
2 stars 0 forks source link

cg-conv hangs on certain inputs #49

Closed rkechols closed 4 years ago

rkechols commented 4 years ago

When the cg-conv utility is run from src/main/java/com/flair/server/utilities/CgConv.java, certain inputs cause it to hang, then time out as programmed.

One such input for Russian is the content of this site, which can be found by searching говорить in Russian with curated domains; it is the 3rd result (as of June 18, 2020)

reynoldsnlp commented 4 years ago

When I use my command line utilities to analyze the text from that page, everything completes fine. Do you have a good way to determine what part of the text is causing the hangup?

reynoldsnlp commented 4 years ago

Does the timeout make the whole search fail, or do we just ignore that page in the results?

rkechols commented 4 years ago

It happens repeatedly across almost the whole page. If the search is run with breakpoints in CgConv.java we can see what string value is causing the problem. The document is still analyzed, but the sentences that cause cg-conv to hang do not have the full analysis.

reynoldsnlp commented 4 years ago

Can you send me a file of exactly the text that is passed in to the analyzer (i.e. what exactly is extracted from the html). I want to see if there are funny characters, or something like that.

rkechols commented 4 years ago

We've discovered that on the same problematic input, cg-conv from VISL CG-3 Disambiguator version 1.3.1.13891 hangs on Windows, but does not hang on Linux. Making a note in the primary readme