openai / openai-go

The official Go library for the OpenAI API
Apache License 2.0
447 stars 29 forks source link

Control characters cause the embeddings endpoint to return "400 Bad Request" response due to invalid json body #113

Open Heremeus opened 3 weeks ago

Heremeus commented 3 weeks ago

Using the Embedding.New function for input texts containing control characters (U+0000 - U+001F and U+007F - U+009F) results in the following error:

POST "https://api.openai.com/v1/embeddings": 400 Bad Request
{
  "error": {
    "message": "We could not parse the JSON body of your request. (HINT: This likely means you aren't using your HTTP library correctly. The OpenAI API expects a JSON payload, but what was sent was not valid JSON. If you have trouble figuring out how to fix this, please contact us through our help center at help.openai.com.)",
    "type": "invalid_request_error",
    "param": null,
    "code": null 
  } 
}

Here are two input texts that result in the error. The first one contains an EOT (U+0004) control character while the second one contains DLE (U+0010) and DC1(U+0011).

"This is the best tl;dr I could make, [original](https://www.princeton.edu/rpds/events_archive/repository/Naidu040313/Flood_HN_Jan2013.pdf) reduced by 99%. (I'm a bot) ***** > Outcome Y in county c and year t is regressed on the fraction of county land flooded in 1927, state-by-year fixed effects, and county fixed effects: Yct = βt F ractionF loodc + αst + αc + ct Note that  is allowed to vary by year, so each estimated  is interpreted as the average difference between flooded counties and non-flooded counties in that year relative to the omitted base year of 1925 or 1920.  > 53 In a modified version of equation, the fraction of county flooded is interacted with a dummy variable for whether the county is a "plantation county" and a dummy variable for whether the county is a "nonplantation county.  > Column reports the within-state difference for each county characteristic by the fraction of the county flooded in 1927: the coefficients are estimated by regressing the indicated county characteristic on the fraction of the county flooded in 1927 and a state fixed effect, weighting by county size.   ***** [**Extended Summary**](http://np.reddit.com/r/autotldr/
"This is the best tl;dr I could make, [original](https://www.richmondfed.org/-/media/richmondfedorg/publications/research/working_papers/2017/pdf/wp17-12.pdf) reduced by 98%. (I'm a bot) ***** > 7 3 Local Dynamics The local dynamics of the simple search and matching model have been studied by Krause and Lubik.  > In the previous literature, for example Mendes and Mendes and Bhattacharya and Bunzel, the backward dynamics are defined via the map g by rearranging to isolate θt :  1/ξ  θt = aθt+1  cθt+1 + d = g. 11 Under risk aversion, the dynamics depend on the time path of output yt.  > 4.2 Stability Properties We now study the dynamics of the backward map zt = f. We first establish the properties of the function f. We then study the stability properties of the steady state, where we distinguish between two broad areas of dynamics in the backward map, namely stable and unstable.   ***** [**Extended Summary**](http://np.reddit.com/r/autotldr/comments/788alk/fed_global_dynamics_in_a_search_and_matching/) | [FAQ](http://np.reddit.com/r/autotldr/comments/31b9fm/faq_autotldr_bot/ ""Version 1.65, ~233564 tl;drs so far."") | [Feedback](http://np.reddit.com/message/compose?to=%23auto

This is the code snippet showing the usage:

input := openai.EmbeddingNewParamsInputArrayOfStrings([]string{text})
response, err := c.client.Embeddings.New(
    c.ctx,
    openai.EmbeddingNewParams{
        Input: openai.F[openai.EmbeddingNewParamsInputUnion](input),
        Model: openai.String("text-embedding-3-small"),
        EncodingFormat: openai.F(openai.EmbeddingNewParamsEncodingFormatFloat),
    },
)

api module version: v0.1.0-alpha.33

I remove the control characters before sending it to the API now but was wondering, if the api library should be able to handle texts containing control characters.

jacobzim-stl commented 3 weeks ago

Thanks for raising this and the detailed report. We'll start investigating.