piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.62k stars 4.38k forks source link

TypeError: a bytes-like object is required, not 'str' #698

Closed Tesfamariam closed 8 years ago

Tesfamariam commented 8 years ago

I am trying to implement dynamic topic modeling with python Anaconda 3.4 distribution on Linux OS.However, I am having the following error: TypeError: a bytes-like object is required, not 'str' Any idea how I could solve this problem?

gojomo commented 8 years ago

The issue tracker is for bugs/feature-requests, not support questions – those are better handled at the project discussion list: https://groups.google.com/forum/#!forum/gensim

And, you'd have to provide a lot more context/code/logging-info for us to have any idea what line of your code is triggering that error. So if you ask on the list, please better describe what you're trying to accomplish, and how.

piskvorky commented 8 years ago

Sounds like a bug report for the DTM wrapper in gensim... but a very incomplete one.

@Tesfamariam, please review the contributing guide. Add relevant information so we know what you're talking about.

Tesfamariam commented 8 years ago

Sorry for the incomplete information! Sample dataset: ['lecture', 'notes', 'edited', 'goos', 'hartmanis', 'van', 'leeuwen', 'berlin', 'heidelberg', 'york', 'barcelona', 'hong', 'kong', 'london', 'milan', 'paris', 'singapore', 'tokyo', 'vassil', 'alexandrov', 'jack', 'dongarra', 'benjoe', 'juliano', 'renner', 'kenneth', 'tan', 'eds', 'san', 'francisco', 'usa', 'proceedings', 'volume', 'editors', 'vassil', 'alexandrov', 'university', 'reading', 'school', 'cybernetics', 'electronic', 'engineering', 'whiteknights', 'box', 'reading', 'mail', 'alexandrov', 'rdg', 'jack', 'dongarra'] Then I feed the whole dataset to: class DTMcorpus(corpora.textcorpus.TextCorpus):

def get_texts(self):
    return self.input

def __len__(self):
    return len(self.input)

corpus = DTMcorpus(texts) Then determined the time slices: my_timeslices = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1,1, 1, 1, 1] model = gensim.models.wrappers.DtmModel('/media/tesfish/data/Topic Modeling/dtm-master/bin/dtm-linux64', corpus, my_timeslices, num_topics=15, id2word=dictionary_text, initialize_lda=True) finally I got the following error: TypeError Traceback (most recent call last)

in () 1 my_timeslices = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1,1, 1, 1, 1] ----> 2 model = gensim.models.wrappers.DtmModel('/media/tesfish/data/Topic Modeling/dtm-master/bin/dtm-linux64', corpus, my_timeslices, num_topics=15, id2word=dictionary_text, initialize_lda=True) /home/tesfish/anaconda3/lib/python3.5/site-packages/gensim/models/wrappers/dtmmodel.py in **init**(self, dtm_path, corpus, time_slices, mode, model, num_topics, id2word, prefix, lda_sequence_min_iter, lda_sequence_max_iter, lda_max_em_iter, alpha, top_chain_var, rng_seed, initialize_lda) 123 124 if corpus is not None: --> 125 self.train(corpus, time_slices, mode, model) 126 127 def fout_liklihoods(self): /home/tesfish/anaconda3/lib/python3.5/site-packages/gensim/models/wrappers/dtmmodel.py in train(self, corpus, time_slices, mode, model) 183 184 """ --> 185 self.convert_input(corpus, time_slices) 186 187 arguments = "--ntopics={p0} --model={mofrl} --mode={p1} --initialize_lda={p2} --corpus_prefix={p3} --outname={p4} --alpha={p5}".format( /home/tesfish/anaconda3/lib/python3.5/site-packages/gensim/models/wrappers/dtmmodel.py in convert_input(self, corpus, time_slices) 174 175 with utils.smart_open(self.ftimeslices(), 'wb') as fout: --> 176 fout.write(six.u(str(len(self.time_slices)) + "\n")) 177 for sl in time_slices: 178 fout.write(six.u(str(sl) + "\n")) TypeError: a bytes-like object is required, not 'str'
tmylk commented 8 years ago

Ping @bhargavvader

tmylk commented 8 years ago

@bhargavvader Do you have any thoughts on this?

bhargavvader commented 8 years ago

@tmylk will have a look.

jonathanicholas commented 8 years ago

Just a +1 -- also having this error.

jonathanicholas commented 8 years ago

with utils.smart_open(self.ftimeslices(), 'wb') as fout: to with utils.smart_open(self.ftimeslices(), 'w') as fout:

as in: http://stackoverflow.com/questions/34283178/typeerror-a-bytes-like-object-is-required-not-str-in-python-and-csv

piskvorky commented 8 years ago

@boomsbloom that is not a good idea as w mode behaves differently on Windows.

Proper solution is to open in binary mode and store binary strings.

bhargavvader commented 8 years ago

@piskvorky , could you elaborate a bit on your proposed solution? I tried poking around but am not too sure how to fix this.

piskvorky commented 8 years ago

I meant simply opening files in binary mode (rb or wb) and then storing binary strings into it. So, if the input is unicode, convert to e.g. utf8 (see gensim.utils.to_utf8()).

I am not familiar with this particular issue though, maybe it's something different. What is the actual problem, why are we storing unicode strings into binary files in this wrapper?

tmylk commented 8 years ago

Ping @bhargavvader

bhargavvader commented 8 years ago

@Tesfamariam , do have a look at the PR, it will fix the problem. I think this issue can be closed now.

tmylk commented 8 years ago

Fixed in #768

gopi3e commented 4 years ago

Nice blog to address the issu https://webkul.com/blog/string-and-bytes-conversion-in-python3-x/