tsdataclinic / smooshr

Tool to consolidate entries and columns from multiple datasets
https://tsdataclinic.github.io/smooshr/
Apache License 2.0
14 stars 4 forks source link

Batch request embedings from the server for performance emprovement #75

Open stuartlynn opened 4 years ago

stuartlynn commented 4 years ago

Currently we send a request per unique word to the embedding server to get that words embedding vector.

The server supports sending multiple words at a time and getting back the results. We should chunk up the requests to make fewer API calls which should make the embedding fetching quicker.

https://github.com/tsdataclinic/smooshr/blob/8b11ccba820434de75a62da5e00e0e336ef3414e/src/utils/calc_embedings.js#L1-L20

This is the function that will need to be modified to run the queries in batches and then correctly assign the result once the batch has been effected.

Things to consider :

1) The server might fail if one or more of the words does not have a representation in the corpus. We would need to fix that here : https://github.com/tsdataclinic/smooshr/blob/8b11ccba820434de75a62da5e00e0e336ef3414e/server/server.py#L66-L80

2) It would be also good to give some feedback on this process that can show in the classification interface to let a user know how much of the embedding has been loaded.