PoT Evaluation code - Githubissues

vis-nlp / ChartGemma

GNU General Public License v3.0

48 stars 0 forks source link

PoT Evaluation code #1

Open Coobiw opened 1 month ago

Coobiw commented 1 month ago

Hi, thanks for your great work! Will you release your PoT evaluation code or share some details about it on ChartQA test split? I want to reproduce this result. Thanks for your advice and reply!

Coobiw commented 1 month ago

After my own DIY as following code, chartgemma reaches 74.64 on ChartQA(64.0 human + 85.28 aug). Is there any question on my implementation? Thanks for your advice and reply!

        def execute_python_code(code):
            old_stdout = sys.stdout
            new_stdout = io.StringIO()
            sys.stdout = new_stdout

            status = True
            try:
                exec(code)
            except Exception as e:
                status = False
            finally:
                sys.stdout = old_stdout

            if status:
                output = new_stdout.getvalue()
            else:
                output = None
            return output, status
        response, status = execute_python_code(response)
        if status:
            answer = response
            print(answer)
        else:
            answer = ""
            print("error running...")

AhmedMasryKU commented 1 month ago

Hi @Coobiw

I am cleaning the remaining codebase and will try to release it when I get some time. However, here are some ideas that we used to optimize the performance on the validation set before running the model on the testing set:

Use the following prompt: "program of thought:" + question.
After executing the code and getting the output text, change "True" and "False" to "Yes" and "No".
Clean the output string from unnecessary characters (\n, ')

Also, we used the following implementation of the evaluation metric: https://github.com/vis-nlp/UniChart/blob/bd6004bc8fe9ef8ce9a6cdfd88712f845d78b918/model/chartqa_model.py#L36

Coobiw commented 1 month ago

using the following code, it can reach 76.56(67.84 human + 85.28 aug):

        answer = answer.replace("True","Yes").replace("False","No")
        answer = answer.strip()

Clean the output string from unnecessary characters (\n, ') Is strip() enough to do this?

AhmedMasryKU commented 1 month ago

I will clean up and share the code that you can use to reproduce the results by this weekend. Sorry, I am a bit busy today and tomorrow.

Coobiw commented 1 month ago

OK! Thanks for your help!!