Open mobileunit opened 6 years ago
Hi Andy,
My apologies for the late reply. It seems to be working if you use the standard options that are passed on to the pattern parser (for a list auf the default values, see http://textblob-de.readthedocs.io/en/stable/api_reference.html#module-textblob_de.parsers). The main problem in your example is that the text is not tokenised properly (punctuation sticks to previous token), which leads to a number of additional mistakes in the tagging process. In addition, the chunks are not computed properly if you use the tags=False
option. If I try this:
from textblob_de import TextBlobDE as TextBlob
from textblob_de import PatternParser
blob = TextBlob("Das ist ein schönes Auto, das du dir da gekauft hast. Das finde ich richtig klasse!", parser=PatternParser(pprint=True))
blob.parse()
I get:
WORD TAG CHUNK ROLE ID PNP LEMMA
Das DT - - - - -
ist VB VP - - - -
ein DT NP - - - -
schönes NN NP ^ - - - -
Auto NN NP ^ - - - -
, , - - - - -
das WDT NP - - - -
du PRP NP ^ - - - -
dir PRP NP ^ - - - -
da IN PP - - - -
gekauft VB VP - - - -
hast NN NP - - - -
. . - - - - -
Das DT NP - - - -
finde NN NP ^ - - - -
ich PRP NP ^ - - - -
richtig JJ ADJP - - - -
klasse JJ ADJP ^ - - - -
! . - - - - -
For counting purposes you need to exclude chunks that are followed by a ^
sign in the pretty_print layout. However, it might be easier to use the standard layout for counting (pprint=False
):
'Das/DT/O/O ist/VB/B-VP/O ein/DT/B-NP/O schönes/NN/I-NP/O Auto/NN/I-NP/O ,/,/O/O das/WDT/B-NP/O du/PRP/I-NP/O dir/PRP/I-NP/O da/IN/B-PP/O gekauft/VB/B-VP/O hast/NN/B-NP/O ././O/O Das/DT/B-NP/O finde/NN/I-NP/O ich/PRP/I-NP/O richtig/JJ/B-ADJP/O klasse/JJ/I-ADJP/O !/./O/O'
This gives you the option of just counting the chunks preceded by a B-
. Unfortunately, there are still quite a few tagging mistakes & chunking mistakes in this output but this is about as accurate as you can get, using the pattern library.
Hope this helps.
Best wishes, Markus
This helps a lot! Thank you :)
Hi there,
thanks a lot for textblob-de.
I found an issue when I try just to get the chunks info. My goal is to count the number of VP, NP, PP... For that I am trying to extract only the chunks. I'm trying to the following code
But then I get the the pos tags in place of the chunk tags when using the pprint option. I could not find a way to get the chunks by type in order to count them. Is there a trick to do so?
WORD TAG CHUNK ROLE ID PNP LEMMA
Das - PDS - - - -
ist - VVFIN - - - -
ein - ARTIND - - - -
schönes - NN - - - -
Auto, - NN ^ - - - -
das - ARTDEF - - - -
du - PPOSAT - - - -
dir - PPER - - - -
da - KOUS - - - -
gekauft - VVFIN PNP - - -
hast. - VVFIN ^ PNP - - -
Das - ARTDEF - - - -
finde - NN - - - -
ich - PPER - - - -
richtig - ADJA - - - -
klasse! - NN - - - -
Obviously counting the chunk tags results in wrong results as each token of the chunk contains the same chunk tag. How could I get the boundaries to count properly? Any suggestions?
Many thanks and best regards, Andy