[label bot] Embeddings Service should use GraphQL API to fetch issue data

Right now the embedding code is using BeautifulSoup to fetch and extract title and body from a GitHub issue. https://github.com/kubeflow/code-intelligence/blob/9bbdce34fc0d81bfb9a63493941763771d2a0746/py/code_intelligence/embeddings.py#L36

I'm noticing that these leads to slight discrepancies between how whitespace is encoded in the resulting body compared to the data we get via the GraphQL API and/or BigQuery.

As an example consider the issue: tps://github.com/kubeflow/katib/issues/1062

Here's the body returned using GraphQL

kind feature\r\n\r\nKatib should have functionality to save Suggestion state somewhere besides Suggestion pod. \r\nSome users would like to resume Experiments, but they don't want to have always running Suggestion deployment. For example we can use PV.\r\n\r\nWe can use `ResumeExperiment` flag from here: https://github.com/kubeflow/katib/issues/1061 to specify resuming experiment mechanism.\r\n\r\n/cc @johnugeorge @gaocegege @hougangliu @richardsliu \r\n

Here's the value returned by get_issue_text

"/kind feature\nKatib should have functionality to save Suggestion state somewhere besides Suggestion pod.\nSome users would like to resume Experiments, but they don't want to have always running Suggestion deployment. For example we can use PV.\nWe can use ResumeExperiment flag from here: #1061 to specify resuming experiment mechanism.\n/cc @johnugeorge @gaocegege @hougangliu @richardsliu

So the whitespace is encoded slightly differently.

Ideally this shouldn't matter because even if the embeddings are different because the whitespace is different arguably the network should still learn to be invariant to these types of perturbations.

kubeflow / code-intelligence

[label bot] Embeddings Service should use GraphQL API to fetch issue data #126