Open jlewi opened 4 years ago
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
kind/feature | 0.69 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
Right now the embedding code is using BeautifulSoup to fetch and extract title and body from a GitHub issue. https://github.com/kubeflow/code-intelligence/blob/9bbdce34fc0d81bfb9a63493941763771d2a0746/py/code_intelligence/embeddings.py#L36
I'm noticing that these leads to slight discrepancies between how whitespace is encoded in the resulting body compared to the data we get via the GraphQL API and/or BigQuery.
As an example consider the issue: tps://github.com/kubeflow/katib/issues/1062
Here's the body returned using GraphQL
Here's the value returned by get_issue_text
So the whitespace is encoded slightly differently.
Ideally this shouldn't matter because even if the embeddings are different because the whitespace is different arguably the network should still learn to be invariant to these types of perturbations.