Help needed for codes for data compile in 10.1038/sdata.2017.127

orange-grape commented 1 year ago

Dear Prof. Elsa Olivetti, I have the honor to send this email to you. My name is SuYa Chen, currently doing my master's degree at Nankai University, majoring in computational electrocatalysis. During my literature reading, I read the paper you published titled “Data Descriptor: Machine-learned and codified synthesis parameters of oxide materials". As a beginner who has just started, your work has been of great guidance to me. Especially, you provided the code for drawing images in the article. However, preparing a similar JSON file (like you provided in https://figshare.com/s/5ff207b4c094d698ebc0) for my own system that I studied is a big challenge for me. After several weeks' failure, we decided to seek help from the top-notch scientists. With that in mind, I was wondering if it would be possible for you to provide us with the codes that could compile the JSON file for my own system. And I can assure you that this information will be used solely for research purposes and treated with the utmost confidentiality. Even though we are eagerly seeking help with the bottleneck issue, I can totally understand if it is not convenient for you to share the files. Please take my sincere apologies if our inquiry brings any inconvenience to you. It would help me a lot if you could help me with this question. At last, I would like to show my sincere gratitude and appreciation to you for your precious time to read this letter. I am sure that I will be very happy if I can get your reply. Sincerely, SuYa Chen

eddotman commented 1 year ago

I can prolly fill in quickly here with some protips:

The JSON is pretty post-processed, in the sense that it is not a single output from a single model. I wouldn't fixate on the KDEs since those are basically visualizations; the hard part is the entity extraction from text that leads to the data needed to create this aggregated JSON.
The code & the SOTA methods for extracting these kinds of entities has changed a lot since 2017. I would recommend looking into modern LLM-based methods.
You should not underestimate the challenge of getting programmatic access to the text from which you want to extract entities. I would recommend seeking approved API level access from relevant publishers, for academic text mining purposes.

ismaelsleem commented 1 month ago

@orange-grape Take a look at this (https://github.com/olivettigroup/table_extractor). It is one of the models I found that may be useful.

@eddotman, can you recommend one of these modern LLM-based methods?

eddotman commented 1 month ago

Hmmm it's quite straightforward to do a structured data extraction task these days with any modern LLM -- there are open options like Llama 3, or commercial options like Command R+, Claude, or GPT-4.

Extracting text and structuring into JSON or tables can be done quite well by any of these models.

eddotman commented 1 month ago

More specifically if you paste some data into a prompt and follow with an instruction to format that data, that is a good starting point.

Then you can use an API to repeat that process at larger scale.

ismaelsleem commented 1 month ago

Dear Professor Kim,

Thank you for your reply; it is always nice to see GitHub projects maintained by their creators.

I apologize if my question seems like a beginner's question. I have been working on ML and NLP for a little over a year now. I am still pursuing my PhD, and my thesis project mainly concerns constructing a full ML framework for chemical reaction data. This is why I have been following your work.

The LLMs mentioned are great, but they still have some limitations and lack the accuracy of a research-level model. Again, thank you for your comments.

eddotman commented 1 month ago

Interesting -- I'm surprised to hear that the latest generation of LLMs is not accurate enough at this task when used zero-shot!

In that case you could try one or more of the following:

use the chat / conversation history to manipulate the past conversation turns into few shot examples of your task
use guided generation modes to guarantee valid JSON schema output from LLMs
build an abstraction layer on top of the LLM layer, with multiple LLM types or multiple LLM calls abstracted away, such that for each piece of text, you extract N times via N calls to various LLMs, and take something like a majority / modal result (assuming there is a single "correct" extraction result)

olivettigroup / sdata-data-plots

Help needed for codes for data compile in 10.1038/sdata.2017.127 #1