mideind / GreynirEngine

A fast, efficient natural language processing engine for Icelandic.
https://greynir.is
Other
60 stars 10 forks source link

Document memory usage and provide parsing a parameter to limit max tokens #21

Closed jokull closed 4 years ago

jokull commented 4 years ago

Parsing leads to extreme memory requirements when it encounters long sentences. Perhaps it should be documented that 8GB is the mimum memory and that 100 tokens is the maximum sentence fragment length?

vthorsteinsson commented 4 years ago

Yes, agreed. It seems that 90 tokens is a reasonable maximum sentence length; anything above that starts to get really heavy, and >100 is in practice unparsable. We will add a maximum length as an optional keyword parameter to the parsing functions.

jokull commented 4 years ago

Example of an extremely difficult sentence with 63 tokens. My 16GB mbp cannot handle it, was maxing out memory when I kill -9ed it:

Að lokinni grenndarkynningu er lagt fram að nýju erindi frá afgreiðslufundi byggingarfulltrúa frá 28. apríl 2020 þar sem sótt er um leyfi til að byggja bílskúr á vesturhlið húss úr forsteyptum einingum, saga niður úr glugga á vesturhlið íbúðar 0101, koma fyrir hurð og stálbrú út á bílskúr og tröppur frá bílskúrþaki niður á lóð fjölbýlishúss nr. 38 við Drápuhlíð.

vthorsteinsson commented 4 years ago

Yes, for some reason this seems to be an extraordinarily difficult sentence. It also maxed out one of our servers, which has 32 GB RAM. It will be interesting to try it on our neural-network based parser - will ask @haukurb to look into it.

haukurb commented 4 years ago

Some wrong attachments, but it seems to get much of the structure right.

+-S
  +-IP
    +-PP
      +-P
        +-fs_þgf: 'Að'
      +-NP
        +-so_0_et_þgf_kvk_lhþt_sb: 'lokinni'
        +-no_et_þgf_kvk: 'grenndarkynningu'
    +-VP
      +-VP-AUX
        +-so_0_p3_et_fh_nt_gm: 'er'
      +-NP-PRD
        +-NP-PRD
          +-so_0_et_hk_lhþt: 'lagt'
        +-ADVP
          +-ao: 'fram'
      +-PP
        +-P
          +-fs_þgf: 'að'
        +-NP
          +-NP
            +-lo_et_þgf_hk_sb: 'nýju'
            +-no_et_þgf_hk: 'erindi'
            +-PP
              +-P
                +-fs_þgf: 'frá'
              +-NP
                +-NP
                  +-no_et_þgf_kk: 'afgreiðslufundi'
                  +-NP-POSS
                    +-no_et_ef_kk: 'byggingarfulltrúa'
                +-ADVP
                  +-P
                    +-fs_þgf: 'frá'
                  +-NP
                    +-dagsföst: '28. apríl 2020'
                    +-P
                      +-ao: 'þar'
                      +-CP-ADV
                        +-C
                          +-C
                            +-stt: 'sem'
                          +-VP
                            +-so_0_et_nf_hk_lhþt_sb: 'sótt'
                        +-VP-AUX
                          +-so_0_p3_et_fh_nt_gm: 'er'
                      +-P
                        +-fs_þf: 'um'
                    +-no_et_þf_hk: 'leyfi'
                    +-PP
                      +-P
                        +-fs_nh: 'til'
                      +-IP
                        +-TO
                          +-nhm: 'að'
                        +-VP
                          +-VP
                            +-so_1_þf_nh_gm: 'byggja'
                          +-NP-OBJ
                            +-no_et_þf_kk: 'bílskúr'
                          +-PP
                            +-P
                              +-fs_þgf: 'á'
                            +-NP
                              +-NP
                                +-no_et_þf_kvk: 'vesturhlið'
                                +-NP-POSS
                                  +-no_et_ef_hk: 'húss'
                              +-PP
                                +-P
                                  +-fs_þgf: 'úr'
                                +-NP
                                  +-lo_ft_þgf_kvk_sb: 'forsteyptum'
                                  +-no_ft_þgf_kvk: 'einingum'
          +-grm: ','
          +-VP
            +-VP
              +-so_0_nh_gm: 'saga'
            +-PP
              +-PP
                +-ADVP
                  +-ao: 'niður'
                +-P
                  +-fs_þgf: 'úr'
                +-NP
                  +-no_et_þgf_kk: 'glugga'
              +-PP
                +-P
                  +-fs_þgf: 'á'
                +-NP
                  +-no_et_þgf_kvk: 'vesturhlið'
                  +-NP-POSS
                    +-no_et_ef_kvk: 'íbúðar'
                    +-tala_et_nf_hk: '0101'
          +-grm: ','
          +-VP
            +-VP
              +-S
                +-C
                  +-VP
                    +-VP
                      +-so_0_ft_nh_gm: 'koma'
                    +-PP
                      +-P
                        +-fs_þf: 'fyrir'
                      +-NP
                        +-NP
                          +-no_et_þf_kvk: 'hurð'
                          +-C
                            +-st: 'og'
                          +-no_et_þf_kvk: 'stálbrú'
                        +-PP
                          +-ADVP
                            +-ao: 'út'
                          +-P
                            +-fs_þf: 'á'
                          +-NP
                            +-no_et_þf_kk: 'bílskúr'
                  +-C
                    +-st: 'og'
              +-no_ft_þf_kvk: 'tröppur'
            +-PP
              +-P
                +-fs_þgf: 'frá'
              +-NP
                +-no_et_þgf_hk: 'bílskúrþaki'
          +-PP
            +-ADVP
              +-ao: 'niður'
            +-P
              +-fs_þf: 'á'
            +-NP
              +-no_et_þf_kvk: 'lóð'
              +-NP-POSS
                +-NP-POSS
                  +-no_et_ef_hk: 'fjölbýlishúss'
                  +-NP
                    +-no_hk: 'nr.'
                    +-tala_nf_hk: '38'
                +-PP
                  +-P
                    +-fs_þf: 'við'
                  +-NP
                    +-sérnafn_et_þf_kvk: 'Drápuhlíð'
+-grm: '.'
vthorsteinsson commented 4 years ago

This has been mostly solved in version 2.7.0 which is MUCH more efficient with memory on long and complex sentences. The sentence parses on our development Linux machine in 11 seconds with insignificant memory use.