rsennrich / ParZu

The Zurich Dependency Parser for German
https://pub.cl.uzh.ch/demo/parzu/
GNU General Public License v2.0
81 stars 19 forks source link

Some texts take very long to parse #29

Closed johann-petrak closed 2 years ago

johann-petrak commented 5 years ago

I am using parzu to get features for documents that I first process with Spacy. For this reason, I am sending pre-tokenized input to Parzu. I noticed that some documents, even if the document or the longest sentence is not very big, need very long to parse.

Here is an example text, already preprocessed/tokenized, url-encoded and added as a parameter to the complete URL to process it:

http://localhost:5003/parse/?inputformat=tokenized&text=F%C3%BCr%0AEin-%0Aund%0AUmsteiger%0A%3C%0Atable%0Awidth%3D%2275%0A%25%0A%22%0Aborder%3D%220%0A%22%0A%3E%0A%3C%0Atr%0Abgcolor%3D%22%23000000%0A%22%0A%3E%0A%3C%0Atd%3E%3Cfont%0Acolor%3D%22%23FFFFFF%0A%22%0Asize%3D%222%0A%22%0Aface%3D%22Arial%0A%2C%0AHelvetica%0A%2C%0Asans-serif%22%3EEin%0ALeitfaden%0Adurch%0Aden%0ALinux-Dschungel%0A%3C%0A/font%3E%3C/td%0A%3E%0A%3C%0A/tr%0A%3E%0A%3C%0A/table%0A%3E%0ATOP-SITES%0Af%0Af%0Ahttp%3A//www.cs.Helsinki.FI/~torvalds/%0AOffizielle%0AHomepage%0Avon%0ALinus%0ATorvalds%0Ahttp%3A//www.zdnet.de/linux/%0AZDNet-Special%0Ahttp%3A//www.linux.com/%0ALinx.com%0A-%0ADie%0ALinux-Site%0Aim%0ANetz%0Ahttp%3A//www.heise.de/ix/linux/%0ADie%0ALinux-Seiten%0Ader%0AZeitschrift-IX%0Ahttp%3A//www.linux.de/%0ALinks%0A%2C%0ANews%0A%2C%0A....%0Ahttp%3A//www.it-nachrichten.de/news/98742119/html/t/links/index.html%0AUmfangreiche%0ALinkseite%0Ahttp%3A//li.luga.or.at/%0ALinux-International%0Ahttp%3A//www.linux.org/%0ALinux-User-Site%0Ahttp%3A//www.linux.at/%0ALinux%0Ain%0A%C3%96sterreich%0AKURSE%0Af%0Af%0Ahttp%3A//tux.isCool.net%0ALinux%0AAnf%C3%A4nger%0AForum%0Ahttp%3A//www.pro-linux.de/%0ANews%0Aund%0AWorksshops%0Ahttp%3A//linux-forum.notrix.de%0ALinux%0AForum%0Ahttp%3A//public.surfree.com/rimez/master.html%0AThe%0ALinux%0ANewbie%0Ahttp%3A//www.rennkuckuck.de/linux/%0ALinux-Einf%C3%BChrung%0A-%0AUmfangreiche%0AEinf%C3%BChrung%0Aund%0ANachschlagewerk%0Ahttp%3A//user.cs.tu-berlin.de/~milenium/index.html%0AN%C3%BCtzliches%0Af%C3%BCr%0AEinsteiger%0Ahttp%3A//www.guug.de/~winni/linux/%0ACrash-Kurs%0Ahttp%3A//www.zfescht.ch/pages/info/linux/linux.htm%0AWichtige%0ALinux-Befehle%0ASOFTWARE%0Af%0Af%0Ahttp%3A//univie.linuxberg.com/%0ALinuxberg%0A-%0ABei%0AWindows-UserInnen%0Aals%0ATUCOWS%0Abekannt%0Aftp.tuwien.ac.at/linux/%0AFTP-Server%0ATU-Wien%0ABENUTZEROBERFL%C3%84CHEN%0Af%0Af%0Ahttp%3A//www.kde.org%0AK%0ADesktop%0AEnvironment%0Ahttp%3A//www.gnome.org%0AGNOME%0AProject

At this moment, the parzu demo servers is not usable (I am getting a 504 gateway time-out) so I could not test this data there, but one it works again it should be testable using:

https://pub.cl.uzh.ch/demo/parzu/parse/?inputformat=tokenized&text=F%C3%BCr%0AEin-%0Aund%0AUmsteiger%0A%3C%0Atable%0Awidth%3D%2275%0A%25%0A%22%0Aborder%3D%220%0A%22%0A%3E%0A%3C%0Atr%0Abgcolor%3D%22%23000000%0A%22%0A%3E%0A%3C%0Atd%3E%3Cfont%0Acolor%3D%22%23FFFFFF%0A%22%0Asize%3D%222%0A%22%0Aface%3D%22Arial%0A%2C%0AHelvetica%0A%2C%0Asans-serif%22%3EEin%0ALeitfaden%0Adurch%0Aden%0ALinux-Dschungel%0A%3C%0A/font%3E%3C/td%0A%3E%0A%3C%0A/tr%0A%3E%0A%3C%0A/table%0A%3E%0ATOP-SITES%0Af%0Af%0Ahttp%3A//www.cs.Helsinki.FI/~torvalds/%0AOffizielle%0AHomepage%0Avon%0ALinus%0ATorvalds%0Ahttp%3A//www.zdnet.de/linux/%0AZDNet-Special%0Ahttp%3A//www.linux.com/%0ALinx.com%0A-%0ADie%0ALinux-Site%0Aim%0ANetz%0Ahttp%3A//www.heise.de/ix/linux/%0ADie%0ALinux-Seiten%0Ader%0AZeitschrift-IX%0Ahttp%3A//www.linux.de/%0ALinks%0A%2C%0ANews%0A%2C%0A....%0Ahttp%3A//www.it-nachrichten.de/news/98742119/html/t/links/index.html%0AUmfangreiche%0ALinkseite%0Ahttp%3A//li.luga.or.at/%0ALinux-International%0Ahttp%3A//www.linux.org/%0ALinux-User-Site%0Ahttp%3A//www.linux.at/%0ALinux%0Ain%0A%C3%96sterreich%0AKURSE%0Af%0Af%0Ahttp%3A//tux.isCool.net%0ALinux%0AAnf%C3%A4nger%0AForum%0Ahttp%3A//www.pro-linux.de/%0ANews%0Aund%0AWorksshops%0Ahttp%3A//linux-forum.notrix.de%0ALinux%0AForum%0Ahttp%3A//public.surfree.com/rimez/master.html%0AThe%0ALinux%0ANewbie%0Ahttp%3A//www.rennkuckuck.de/linux/%0ALinux-Einf%C3%BChrung%0A-%0AUmfangreiche%0AEinf%C3%BChrung%0Aund%0ANachschlagewerk%0Ahttp%3A//user.cs.tu-berlin.de/~milenium/index.html%0AN%C3%BCtzliches%0Af%C3%BCr%0AEinsteiger%0Ahttp%3A//www.guug.de/~winni/linux/%0ACrash-Kurs%0Ahttp%3A//www.zfescht.ch/pages/info/linux/linux.htm%0AWichtige%0ALinux-Befehle%0ASOFTWARE%0Af%0Af%0Ahttp%3A//univie.linuxberg.com/%0ALinuxberg%0A-%0ABei%0AWindows-UserInnen%0Aals%0ATUCOWS%0Abekannt%0Aftp.tuwien.ac.at/linux/%0AFTP-Server%0ATU-Wien%0ABENUTZEROBERFL%C3%84CHEN%0Af%0Af%0Ahttp%3A//www.kde.org%0AK%0ADesktop%0AEnvironment%0Ahttp%3A//www.gnome.org%0AGNOME%0AProject

rsennrich commented 5 years ago

The short answer is that ParZu uses CYK parsing internally, and the main reason it (mostly) remains fast for long sentences is aggressive pruning of hypotheses.

It seems that in this case, the list of links is interpreted as a huge noun phrase (lines 70-140), which explodes the search space.

I updated parzu_server.py to catch pexpect timeouts, and there's now also an option to increase the timeout. If you want, you can also fiddle with the parameters 'levels', 'aggressive_start' and 'alter' in core/ParZu_parameters.py to do more aggressive pruning.

johann-petrak commented 5 years ago

Thanks ... the text that gets passed on to ParZu is pretty rubbish, in other cases, there are artefacts of bad conversion from HTML or similar. From my point of view, I could just try to recognise some of the most frequent oddities (like URLs, HTML entities etc) and replace them with some token that will get ignored by Parzu (what would be good for this?).

rsennrich commented 5 years ago

It looks like the HTML wasn't the main problem, but the long sequence of names and similar in the second half. My recommendation would be that if you extract something that was originally a list, to treat each item in the list as a separate sentence. Not sure if you're able to change your HTML conversion this way.