Closed vidarmehr closed 3 years ago
I will share with you a pubmed file with 100,000 lines in google drive.
Could you also add the minimal code needed to reproduce this?
import pandas as pd
from embiggen import CorpusTransformer
min_counts=5
window_size=5
year = 2021
pubmed_data = pd.read_csv(
"/projects/robinson-lab/marea/data/pubmed_cr/new/pubmed_cr.tsv",
header = None,
sep='\t',
nrows=100000
)
pubmed_year = pubmed_data.loc[pubmed_data.loc[:,1]< year]
pubmed = pubmed_year.loc[:,2].tolist()
transformer = CorpusTransformer(
apply_stemming=False,
verbose=False,
remove_stop_words=False,
remove_punctuation = False,
min_word_length=0,
to_lower_case=False,
min_count = min_counts,
min_sequence_length=window_size*2+1
)
transformer.fit(pubmed)
encoded_pubmed = transformer.transform(pubmed)
The path to pubmed_data should be the file that I shared with you.
There are NaN values in the PubMed file, that is what is causing the issue. I will add an exception to notify immediately of this issue so that the error will be easier to understand.
Oh, Ok. Thanks, Luca. I will also check with Hannah because this is a new file that was generated. I don't think we had NaN values in the second column before.
I have updated the exception that is now more explicative for these cases. I have already published the updated version, so you can simply run pip install embiggen -U
to get the latest.
Hi Luca, Here is the error message when running word embedding on pubmed: multiprocessing.pool.RemoteTraceback: