olivettigroup / sdata-data-plots

Plots for "Machine-learned and codified synthesis parameters of oxide materials" in the journal Scientific Data
MIT License
12 stars 6 forks source link

Help needed for codes for data compile in 10.1038/sdata.2017.127 #1

Open orange-grape opened 1 year ago

orange-grape commented 1 year ago

Dear Prof. Elsa Olivetti, I have the honor to send this email to you. My name is SuYa Chen, currently doing my master's degree at Nankai University, majoring in computational electrocatalysis. During my literature reading, I read the paper you published titled “Data Descriptor: Machine-learned and codified synthesis parameters of oxide materials". As a beginner who has just started, your work has been of great guidance to me. Especially, you provided the code for drawing images in the article. However, preparing a similar JSON file (like you provided in https://figshare.com/s/5ff207b4c094d698ebc0) for my own system that I studied is a big challenge for me. After several weeks' failure, we decided to seek help from the top-notch scientists. With that in mind, I was wondering if it would be possible for you to provide us with the codes that could compile the JSON file for my own system. And I can assure you that this information will be used solely for research purposes and treated with the utmost confidentiality. Even though we are eagerly seeking help with the bottleneck issue, I can totally understand if it is not convenient for you to share the files. Please take my sincere apologies if our inquiry brings any inconvenience to you. It would help me a lot if you could help me with this question. At last, I would like to show my sincere gratitude and appreciation to you for your precious time to read this letter. I am sure that I will be very happy if I can get your reply. Sincerely, SuYa Chen

eddotman commented 1 year ago

I can prolly fill in quickly here with some protips:

ismaelsleem commented 1 month ago

@orange-grape Take a look at this (https://github.com/olivettigroup/table_extractor). It is one of the models I found that may be useful.

@eddotman, can you recommend one of these modern LLM-based methods?

eddotman commented 1 month ago

Hmmm it's quite straightforward to do a structured data extraction task these days with any modern LLM -- there are open options like Llama 3, or commercial options like Command R+, Claude, or GPT-4.

Extracting text and structuring into JSON or tables can be done quite well by any of these models.

eddotman commented 1 month ago

More specifically if you paste some data into a prompt and follow with an instruction to format that data, that is a good starting point.

Then you can use an API to repeat that process at larger scale.

ismaelsleem commented 1 month ago

Dear Professor Kim,

Thank you for your reply; it is always nice to see GitHub projects maintained by their creators.

I apologize if my question seems like a beginner's question. I have been working on ML and NLP for a little over a year now. I am still pursuing my PhD, and my thesis project mainly concerns constructing a full ML framework for chemical reaction data. This is why I have been following your work.

The LLMs mentioned are great, but they still have some limitations and lack the accuracy of a research-level model. Again, thank you for your comments.

eddotman commented 1 month ago

Interesting -- I'm surprised to hear that the latest generation of LLMs is not accurate enough at this task when used zero-shot!

In that case you could try one or more of the following: