stanfordnlp / treelstm

Tree-structured Long Short-Term Memory networks (http://arxiv.org/abs/1503.00075)
GNU General Public License v2.0
877 stars 234 forks source link

Unable to convert data/glove/glove.840B.300d.txt to Torch serialized format #13

Open sravyapolisetty opened 7 years ago

sravyapolisetty commented 7 years ago

Error occurs in convert-wordvecs.lua at

vecs[{i, j}] = tonumber(tokens[j + 1])

screen shot 2017-07-11 at 2 53 40 pm

The conversion fails as the data/glove/glove.840B.300d.txt file has non-UTF8 and Non-ASCII characters. Did anyone face this issue with data/glove/glove.840B.300d.txt file?

Changed the for loop to add the condition to avoid conversion to number and writing to vocab, if the second token in each count is not a number.(the problem is because if there is no number, then tonumber returns nil)

for i = 1, count do
 repeat
  xlua.progress(i, count)
  local tokens = stringx.split(file:read())
  if tonumber(tokens[2]) == nil then break end
  local word = tokens[1]
  vocab:write(word .. '\n')
  for j = 1, dim do
   vecs[{i, j}] = tonumber(tokens[j + 1])
  end
 until true
end

The above fix solves the issue, but I would like to know if this is the correct solution for the problem.