tingxueronghua / ChartLlama-code

MIT License
178 stars 17 forks source link

Data release? #8

Closed echo840 closed 9 months ago

echo840 commented 9 months ago

Great job, thank you for sharing. I would like to know if you will open-source the complete dataset instead of 100 examples. If so, when can we expect it to be released?

tingxueronghua commented 9 months ago

Thanks for your interest in our work! These examples are used for evaluation (and in fact they are just around 100 examples for each category).

I am still cleaning the synthetic dataset and I must admit that the font sizes are really hard to control and thus will make the generated images look quite terrible. Quite worry that some people might blame the work due to these bad cases. Once the clean process finishes, I will upload the training dataset. But the progress might be a little slow because I am also busy with next work... Feel quite sorry for it.

dydxdt commented 9 months ago

Thanks for your interest in our work! These examples are used for evaluation (and in fact they are just around 100 examples for each category).

I am still cleaning the synthetic dataset and I must admit that the font sizes are really hard to control and thus will make the generated images look quite terrible. Quite worry that some people might blame the work due to these bad cases. Once the clean process finishes, I will upload the training dataset. But the progress might be a little slow because I am also busy with next work... Feel quite sorry for it.

So the ratio of bad cases are small and they don't affect the final training results, right? I'm also trying to use gpt4 to generate data.

tingxueronghua commented 9 months ago

Thanks for your interest in our work! These examples are used for evaluation (and in fact they are just around 100 examples for each category). I am still cleaning the synthetic dataset and I must admit that the font sizes are really hard to control and thus will make the generated images look quite terrible. Quite worry that some people might blame the work due to these bad cases. Once the clean process finishes, I will upload the training dataset. But the progress might be a little slow because I am also busy with next work... Feel quite sorry for it.

So the ratio of bad cases are small and they don't affect the final training results, right? I'm also trying to use gpt4 to generate data.

For traditional bar charts, line charts, and pie charts, the ratios of bad cases are quite small (and the diversity mainly depends on your provided in-context examples). But for charts of new types, the bad cases are a lot more than those of traditional charts.

The most annoying problem is the font sizes and locations of legends. Because GPT-4 does not have a very good sense of how to avoid occlusion. This could also only be solved by good in-context examples.

I suggest you add a human filtering process after the generation processes to ensure the quality. This could definitely improve the quality of samples and will not cost much, because humans only need to filter out the unreasonable cases.

bill4689 commented 4 months ago

Any plan to open-source the code for generating chart data? Thanks

limaopeng1 commented 3 months ago

Will the training data still be open-sourced?