neoVincent / Document_Similarity

Large scale document similarity analysis on Spark
0 stars 0 forks source link

Convert document vector directly from spark frame #1

Open neoVincent opened 4 years ago

neoVincent commented 4 years ago

In the part of computing similarity of all documents

    cosines = docdf.rdd.map(lambda d: (core.cosine(vec.tolist(), getVec(d[0]).tolist()), d[0]))
    topK = cosines.sortByKey(ascending=False, numPartitions=1).collect()

everytime we have to use getVec() to retrieve the vec of the document from database, this adds a lot of time to the execution.

  • using getVec() is a work around for error parsing blob data to ndarry from spark frame directly.
  • bc converting directly from spark the data contains nan value in vector.
  • Maybe different encoding for db and spark
neoVincent commented 4 years ago

Another workaround, store the vector as string