`gen` creating unnecessary html tags that prevent extraction

fabriceyhc commented 4 months ago

The bug For several models (e.g. Llama-3 8/70B and Mixtral), using gen causes them to wrap every output in some html tokens:

 '###Feedback: \n'
 '```json\n'
 '{\n'
 '    "doc_id": 243,\n'
 '    "sentence_id": 36,\n'
 '    "medical_note": "Patient reports diagnosis of bipolar disorder and '
 'polysubstance abuse in the past.",\n'
 '    "Heroin Use Explanation":<||_html:<span style=\'background-color: '
 'rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;\' title=\'1.0\'>_||> "",\n'
 "<||_html:</span>_||><||_html:<span style='background-color: rgba(0.0, 165.0, "
 "0, 0.15); border-radius: 3px;' title='1.0'>_||>   "
 "<||_html:</span>_||><||_html:<span style='background-color: rgba(0.0, 165.0, "
 "0, 0.15); border-radius: 3px;' title='1.0'>_||> "
 '"<||_html:</span>_||><||_html:<span style=\'background-color: rgba(0.0, '
 "165.0, 0, 0.15); border-radius: 3px;' "
 "title='1.0'>_||>C<||_html:</span>_||><||_html:<span style='background-color: "
 "rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' "
 "title='1.0'>_||>oc<||_html:</span>_||><||_html:<span "
 "style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' "
 "title='1.0'>_||>aine<||_html:</span>_||><||_html:<span "
 "style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' "
 "title='1.0'>_||> Use<||_html:</span>_||><||_html:<span "
 "style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' "
 "title='1.0'>_||> Explanation<||_html:</span>_||><||_html:<span "
 "style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' "
 'title=\'1.0\'>_||>":<||_html:</span>_||><||_html:<span '
 "style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' "
 'title=\'1.0\'>_||> "",\n'

The text I want is in there, but can't be extracted efficiently or at all.

To Reproduce

import time
import guidance
from guidance import models, gen, select

model_id = "meta-llama/Meta-Llama-3-8B-Instruct" # "TechxGenus/Meta-Llama-3-70B-Instruct-GPTQ" # "TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ"

llm = models.Transformers(
    model_id, 
    echo=True,
    cache_dir="/data2/.shared_models/", 
    device_map='auto'
)

@guidance
def annotate_drugabuse_w_explanations(lm, doc_id, sentence_id, medical_note):
    lm += f"""\
    ###Task Description: 
    Please carefully review the following medical note for any mentions of drug use:

    ###The medical note to evaluate:
    {medical_note}

    ###Feedback: 
    ```json
    {{
        "doc_id": {doc_id},
        "sentence_id": {sentence_id},
        "medical_note": "{medical_note}",
        "Heroin Use Explanation": "{gen('Heroin Use Explanation', stop='"')}",
        "Heroin Use": {select(options=['True', 'False'], name='Heroin')},
        "Cocaine Use Explanation": "{gen('Cocaine Use Explanation', stop='"')}",
        "Cocaine Use": {select(options=['True', 'False'], name='Cocaine')},
    }}```"""
    return lm

test = {
    "doc_id": 243,
    "sentence_id": 36, 
    "medical_note": "Patient reports diagnosis of bipolar disorder and polysubstance abuse in the past."
}

a = time.time()
output = llm + annotate_drugabuse_w_explanations(**test)
time_taken = time.time() - a
print(output)
print(f"time_taken: {time_taken}")

System info (please complete the following information):

OS: Ubuntu
Guidance Version (guidance.__version__): 0.1.15

maxencealluin commented 4 months ago

I believe that is the IPython polluting the output, change echo=True to echo=False in your model initialisation and that should fix it.

fabriceyhc commented 4 months ago

To be clear, I'm not running it inside a notebook and when even I turn echo off, I just get an empty string even though I can see its generating some tokens that I should be getting when echo=True.

It does the same thing for the select but it looks like it's still able to parse out the boolean I need.

Harsha-Nori commented 4 months ago

Hi Fabrice,

Just typing out a quick reply on my phone here, but you can access a clean, stored state with the getitem method on the lm objects with the name of the generation. In your example, you assigned names to

Cocaine

Cocaine Use Explanation

Heroine

Heroine Use Explanation

In the gen and select calls. You can extract the generated values on these by doing:

llm[“Cocaine”]

llm[“Cocaine Use Explanation”]

etc

If you want to name and capture a long segment of text (eg multiple gen calls + plaintext), you can wrap everything inside the capture function (from guidance import capture), and give everything inside the capture a name.

Hope this helps, and sorry for any poor formatting/typos while I’m on mobile!

Get Outlook for iOShttps://aka.ms/o0ukef

From: Fabrice Harel-Canada @.> Sent: Tuesday, May 28, 2024 12:21:44 AM To: guidance-ai/guidance @.> Cc: Subscribed @.***> Subject: Re: [guidance-ai/guidance] gen creating unnecessary html tags that prevent extraction (Issue #861)

To be clear, I'm not running it inside a notebook and when even I turn echo off, I just get an empty string even though I can see its generating some tokens that I should be getting when echo=True.

It does the same thing for the select but it looks like it's still able to parse out the boolean I need.

— Reply to this email directly, view it on GitHubhttps://github.com/guidance-ai/guidance/issues/861#issuecomment-2134515307 or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABIOOZ26YTNBY4QGMEIGVEDZEQWARBFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLLDTOVRGUZLDORPXI6LQMWWES43TOVSUG33NNVSW45FGORXXA2LDOOJIFJDUPFYGLKTSMVYG643JORXXE6NFOZQWY5LFVE2TMNBUGQZTENJVQKSHI6LQMWSWS43TOVS2K5TBNR2WLKRSGMYTQMBWGI2DSNNHORZGSZ3HMVZKMY3SMVQXIZI. You are receiving this email because you are subscribed to this thread.

Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

fabriceyhc commented 4 months ago

Got it! Strange that it was giving me empty strings when printing output, but with the key extraction is working fine! Thank you so much for an amazing tool.

guidance-ai / guidance

`gen` creating unnecessary html tags that prevent extraction #861