Error with running word embedding on pubmed

vidarmehr commented 3 years ago

Hi Luca, Here is the error message when running word embedding on pubmed: multiprocessing.pool.RemoteTraceback:

Traceback (most recent call last):
  File "/home/ravanv/anaconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/ravanv/anaconda3/lib/python3.7/site-packages/embiggen/transformers/corpus_transformer.py", line 146, in tokenize_lines
    for line in lines
  File "/home/ravanv/anaconda3/lib/python3.7/site-packages/embiggen/transformers/corpus_transformer.py", line 146, in <listcomp>
    for line in lines
  File "/home/ravanv/anaconda3/lib/python3.7/site-packages/embiggen/transformers/corpus_transformer.py", line 126, in tokenize_line
    for word in word_tokenize(line.lower() if self._to_lower_case else line)
  File "/home/ravanv/.local/lib/python3.7/site-packages/nltk/tokenize/__init__.py", line 129, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "/home/ravanv/.local/lib/python3.7/site-packages/nltk/tokenize/__init__.py", line 107, in sent_tokenize
    return tokenizer.tokenize(text)
  File "/home/ravanv/.local/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1272, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/home/ravanv/.local/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1326, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/home/ravanv/.local/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1326, in <listcomp>
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/home/ravanv/.local/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1316, in span_tokenize
    for sl in slices:
  File "/home/ravanv/.local/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1357, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "/home/ravanv/.local/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 314, in _pair_iter
    prev = next(it)
  File "/home/ravanv/.local/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1330, in _slices_from_text
    for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or bytes-like object
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/projects/robinson-lab/vidar/wordembedding_Pubmed_skipgram/embiggen/Word_embedding_Skipgram.py", line 32, in <module>
    transformer.fit(pubmed)
  File "/home/ravanv/anaconda3/lib/python3.7/site-packages/embiggen/transformers/corpus_transformer.py", line 231, in fit
    tokens_list, counts = self.tokenize(texts, True)
  File "/home/ravanv/anaconda3/lib/python3.7/site-packages/embiggen/transformers/corpus_transformer.py", line 177, in tokenize
    disable=not self._verbose
  File "/home/ravanv/anaconda3/lib/python3.7/site-packages/embiggen/transformers/corpus_transformer.py", line 168, in <listcomp>
    line
  File "/home/ravanv/.local/lib/python3.7/site-packages/tqdm/std.py", line 1107, in __iter__
    for obj in iterable:
  File "/home/ravanv/anaconda3/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
TypeError: expected string or bytes-like object

vidarmehr commented 3 years ago

I will share with you a pubmed file with 100,000 lines in google drive.

LucaCappelletti94 commented 3 years ago

Could you also add the minimal code needed to reproduce this?

vidarmehr commented 3 years ago

import pandas as pd
from embiggen import CorpusTransformer

min_counts=5
window_size=5

year = 2021
pubmed_data = pd.read_csv(
    "/projects/robinson-lab/marea/data/pubmed_cr/new/pubmed_cr.tsv",
    header = None,
    sep='\t',
    nrows=100000
)
pubmed_year = pubmed_data.loc[pubmed_data.loc[:,1]< year]
pubmed = pubmed_year.loc[:,2].tolist()

transformer = CorpusTransformer(
    apply_stemming=False,
    verbose=False,
    remove_stop_words=False,
    remove_punctuation = False,
    min_word_length=0,
    to_lower_case=False,
    min_count = min_counts,
    min_sequence_length=window_size*2+1
)

transformer.fit(pubmed)
encoded_pubmed = transformer.transform(pubmed)

vidarmehr commented 3 years ago

The path to pubmed_data should be the file that I shared with you.

LucaCappelletti94 commented 3 years ago

There are NaN values in the PubMed file, that is what is causing the issue. I will add an exception to notify immediately of this issue so that the error will be easier to understand.

vidarmehr commented 3 years ago

Oh, Ok. Thanks, Luca. I will also check with Hannah because this is a new file that was generated. I don't think we had NaN values in the second column before.

LucaCappelletti94 commented 3 years ago

I have updated the exception that is now more explicative for these cases. I have already published the updated version, so you can simply run pip install embiggen -U to get the latest.

monarch-initiative / embiggen

Error with running word embedding on pubmed #237