gururise / AlpacaDataCleaned

Alpaca dataset from Stanford, cleaned and curated
Apache License 2.0
1.5k stars 149 forks source link

Identify code snippet in "input" fields #57

Open torshie opened 1 year ago

torshie commented 1 year ago

I want to translate the training data into another language with Google translate, but code snippets should not be translated, so I have to replace code snippets with placeholders before translating.

All code snippets in "output" fields are quoted with triple backticks, so they're quite easy to identify. But code snippets in "input" fields aren't quoted.

Any suggestions on identifying those in "input" fields ?

Wheeledparasite commented 1 year ago

I dont think there is an easy way to identify the code in the 'input' field. The reason for this is because the input field represents the user's input, and they are likely not going to embed code with triple backticks.

Perhaps you could use some type of regular expression to detect code?