jboynyc / textnets

Text analysis with networks.
https://textnets.readthedocs.io/
GNU General Public License v3.0
284 stars 23 forks source link

Deal with empty documents #13

Closed jboynyc closed 4 years ago

jboynyc commented 4 years ago
>>> import textnets as tn
>>> import pandas as pd
>>> 
>>> s = pd.Series(['text 1', None, 'text 3', 'text 4'], index=list('ABCD'))
>>> tn.Corpus(s)
Corpus(4 docs: A, B, C…)
>>> tn.Corpus(s).tokenized()
# results in error because of document B

Either silently discard empties, discard and warn, or provide an option in Corpus init method.

jboynyc commented 4 years ago

New behavior:

>>> import textnets as tn
>>> import pandas as pd
>>> s = pd.Series(['text 1', None, 'text 3', 'text 4'], index=list('ABCD'))
>>> tn.Corpus(s).tokenized()
.../textnets/textnets/corpus.py:64: UserWarning: Dropping 1 empty document(s).
  warnings.warn(f"Dropping {missings} empty document(s).")
       term  n
label         
A      text  1
C      text  1
D      text  1