kevinlu1248 / pyate

PYthon Automated Term Extraction
https://kevinlu1248.github.io/pyate/
MIT License
303 stars 37 forks source link

ATE not behaving as in documentation #49

Open TakamotoAI opened 2 years ago

TakamotoAI commented 2 years ago

After installing pyATE as for documentation, if I run the examples in the documentation I get different and worst results.

Running:

from pyate import combo_basic

# source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1994795/
string = """Central to the development of cancer are genetic changes that endow these “cancer cells” with many of the
hallmarks of cancer, such as self-sufficient growth and resistance to anti-growth and pro-death signals. However, while the
genetic changes that occur within cancer cells themselves, such as activated oncogenes or dysfunctional tumor suppressors,
are responsible for many aspects of cancer development, they are not sufficient. Tumor promotion and progression are
dependent on ancillary processes provided by cells of the tumor environment but that are not necessarily cancerous
themselves. Inflammation has long been associated with the development of cancer. This review will discuss the reflexive
relationship between cancer and inflammation with particular focus on how considering the role of inflammation in physiologic
processes such as the maintenance of tissue homeostasis and repair may provide a logical framework for understanding the U
connection between the inflammatory response and cancer."""

print(combo_basic(string).sort_values(ascending=False))

I obtain:

aspects of cancer                               3.348612
many aspects of cancer                          2.336294
aspects of cancer development                   2.336294
development of cancer                           2.197225
cancer development                              2.193147
many aspects                                    2.193147
cells of the tumor                              2.136294
many aspects of cancer development              2.109438
maintenance of tissue                           1.848612
cells of the tumor environment                  1.809438
connection between the inflammatory response    1.709438
maintenance of tissue homeostasis               1.586294
inflammation with particular focus              1.486294
tissue homeostasis                              1.443147
tumor environment                               1.443147
particular focus                                1.443147
dysfunctional tumor                             1.443147
tumor suppressors                               1.443147
inflammatory response                           1.443147
cancer cells                                    1.386294
genetic changes                                 1.386294
dysfunctional tumor suppressors                 1.298612
role of inflammation                            1.098612
relationship between cancer                     1.098612
hallmarks of cancer                             1.098612
death signals                                   0.693147
sufficient growth                               0.693147
tumor promotion                                 0.693147
ancillary processes                             0.693147
logical framework                               0.693147
dtype: float64

Does the package need a special config to be set?

kevinlu1248 commented 2 years ago

I think it could have something to do with the fact that spaCy's process is stochastic or the fact that I ran this on a previous version of spaCy, although I'm not too sure.