kevinlu1248 / pyate

PYthon Automated Term Extraction
https://kevinlu1248.github.io/pyate/
MIT License
305 stars 37 forks source link

Crash on short input; len(technical_counts) == 0 after filtering #1

Closed james-daily closed 4 years ago

james-daily commented 4 years ago

In combo_basic.py:

    if len(technical_counts) == 0:
        return pd.Series()

    order = sorted(
        list(technical_counts.keys()), key=TermExtraction.word_length, reverse=True
    )

    if not have_single_word:
        order = list(filter(lambda s: TermExtraction.word_length(s) > 1, order))

    technical_counts = technical_counts[order]

    df = pd.DataFrame(
        {
            "xlogx_score": technical_counts.reset_index()
            .apply(
                lambda s: math.log(TermExtraction.word_length(s["index"])) * s[0],
                axis=1,
            )
            .values,
            "times_subset": 0,
            "times_superset": 0,
        },
        index=technical_counts.index,
    )

The call to pd.DataFrame() can fail if technical_counts is empty after technical_counts = technical_counts[order]. This can be avoided with a second check for an empty Series, e.g.:

    technical_counts = technical_counts[order]

    if len(technical_counts) == 0:
        return pd.Series()

Minimal working example:

import spacy
from pyate.term_extraction_pipeline import TermExtractionPipeline
nlp = spacy.load("en")
nlp.add_pipe(TermExtractionPipeline())
text = "This sentence is short."
nlp(text)
kevinlu1248 commented 4 years ago

Fixed in https://github.com/kevinlu1248/pyate/commit/614a4ec97a291a2c19955cacb6fb11acbadbf644. Thanks for pointing it out @james-daily . Please update to pyate==0.3.2 for the fix.