markuskiller / textblob-de

German language support for TextBlob.
https://textblob-de.readthedocs.org
MIT License
103 stars 12 forks source link

PatternParser problem #17

Open mobileunit opened 6 years ago

mobileunit commented 6 years ago

Hi there,

thanks a lot for textblob-de.
I found an issue when I try just to get the chunks info. My goal is to count the number of VP, NP, PP... For that I am trying to extract only the chunks. I'm trying to the following code

from textblob_de import PatternParser
blob = TextBlob("Das ist ein schönes Auto, das du dir da gekauft hast. Das finde ich richtig klasse!", parser=PatternParser(pprint=True, chunks= True, tags=False, relations=True, lemmata=False, tokenize=False,  tagset = "UNIVERSAL"))
blob.parse()

But then I get the the pos tags in place of the chunk tags when using the pprint option. I could not find a way to get the chunks by type in order to count them. Is there a trick to do so?

WORD TAG CHUNK ROLE ID PNP LEMMA
Das - PDS - - - -
ist - VVFIN - - - -
ein - ARTIND - - - -
schönes - NN - - - -
Auto, - NN ^ - - - -
das - ARTDEF - - - -
du - PPOSAT - - - -
dir - PPER - - - -
da - KOUS - - - -
gekauft - VVFIN PNP - - -
hast. - VVFIN ^ PNP - - -
Das - ARTDEF - - - -
finde - NN - - - -
ich - PPER - - - -
richtig - ADJA - - - -
klasse! - NN - - - -

Obviously counting the chunk tags results in wrong results as each token of the chunk contains the same chunk tag. How could I get the boundaries to count properly? Any suggestions?

Many thanks and best regards, Andy

markuskiller commented 6 years ago

Hi Andy,

My apologies for the late reply. It seems to be working if you use the standard options that are passed on to the pattern parser (for a list auf the default values, see http://textblob-de.readthedocs.io/en/stable/api_reference.html#module-textblob_de.parsers). The main problem in your example is that the text is not tokenised properly (punctuation sticks to previous token), which leads to a number of additional mistakes in the tagging process. In addition, the chunks are not computed properly if you use the tags=False option. If I try this:

from textblob_de import TextBlobDE as TextBlob
from textblob_de import PatternParser
blob = TextBlob("Das ist ein schönes Auto, das du dir da gekauft hast. Das finde ich richtig klasse!", parser=PatternParser(pprint=True))
blob.parse()

I get:

          WORD   TAG    CHUNK    ROLE   ID     PNP    LEMMA

           Das   DT     -        -      -      -      -
           ist   VB     VP       -      -      -      -
           ein   DT     NP       -      -      -      -
       schönes   NN     NP ^     -      -      -      -
          Auto   NN     NP ^     -      -      -      -
             ,   ,      -        -      -      -      -
           das   WDT    NP       -      -      -      -
            du   PRP    NP ^     -      -      -      -
           dir   PRP    NP ^     -      -      -      -
            da   IN     PP       -      -      -      -
       gekauft   VB     VP       -      -      -      -
          hast   NN     NP       -      -      -      -
             .   .      -        -      -      -      -
           Das   DT     NP       -      -      -      -
         finde   NN     NP ^     -      -      -      -
           ich   PRP    NP ^     -      -      -      -
       richtig   JJ     ADJP     -      -      -      -
        klasse   JJ     ADJP ^   -      -      -      -
             !   .      -        -      -      -      -

For counting purposes you need to exclude chunks that are followed by a ^ sign in the pretty_print layout. However, it might be easier to use the standard layout for counting (pprint=False):

'Das/DT/O/O ist/VB/B-VP/O ein/DT/B-NP/O schönes/NN/I-NP/O Auto/NN/I-NP/O ,/,/O/O das/WDT/B-NP/O du/PRP/I-NP/O dir/PRP/I-NP/O da/IN/B-PP/O gekauft/VB/B-VP/O hast/NN/B-NP/O ././O/O Das/DT/B-NP/O finde/NN/I-NP/O ich/PRP/I-NP/O richtig/JJ/B-ADJP/O klasse/JJ/I-ADJP/O !/./O/O'

This gives you the option of just counting the chunks preceded by a B-. Unfortunately, there are still quite a few tagging mistakes & chunking mistakes in this output but this is about as accurate as you can get, using the pattern library.

Hope this helps.

Best wishes, Markus

mobileunit commented 6 years ago

This helps a lot! Thank you :)