Open neoVincent opened 4 years ago
In the part of computing similarity of all documents
cosines = docdf.rdd.map(lambda d: (core.cosine(vec.tolist(), getVec(d[0]).tolist()), d[0])) topK = cosines.sortByKey(ascending=False, numPartitions=1).collect()
everytime we have to use getVec() to retrieve the vec of the document from database, this adds a lot of time to the execution.
getVec()
using getVec() is a work around for error parsing blob data to ndarry from spark frame directly. bc converting directly from spark the data contains nan value in vector. Maybe different encoding for db and spark
Another workaround, store the vector as string
In the part of computing similarity of all documents
everytime we have to use
getVec()
to retrieve the vec of the document from database, this adds a lot of time to the execution.