Use LLMs to generate and read plots. Combine them into a teacher-student pair to improve performance.
Build a tool that can aggregate information from scientific figures into structured data. For example, extract the functional properities of different variants of opsins - both natual and engineered.
Design an agent that can take in a paper and extract the quantitative results into structured data.
Design an agent that takes in a data scenario and paper/figure examples and then outputs structured data, the data plotted into a figure, and several quantitative questions about the figure with difficulty rankings.
For now, I'm prototyping independent plot readers and generators, but the original idea was to combine the reader and generator into an RL training loop with the reader as student and the generator as teacher.
Input Figure (from Machine learning-guided channelrhodopsin engineering enables minimally invasive optogenetics ):
Prompt:
prompt = "For each plot in each panel determine the experiment in terms of independent variables (IVs) and dependent variable (DV)."
Structured Output:
Panel: a
Plot: Current traces
independent_variables=['Time', 'ChR variant'] dependent_variables=['Current']
Panel: b
Plot: Photocurrent strength with different wavelength excitation
independent_variables=['ChR variant', 'Wavelength', 'Current type (Peak/Steady state)'] dependent_variables=['Photocurrent']
Panel: c
Plot: Off-kinetics decay time
independent_variables=['ChR variant'] dependent_variables=['Off-kinetics decay time (τoff)']
Plot: Current traces for select variants
independent_variables=['Time', 'ChR variant'] dependent_variables=['Current']
Panel: d
Plot: Wavelength sensitivity
independent_variables=['Wavelength', 'ChR variant'] dependent_variables=['Normalized photocurrent']
Panel: e
Plot: Peak photocurrent vs Intensity
independent_variables=['Light intensity', 'ChR variant'] dependent_variables=['Peak photocurrent']
Plot: Steady-state photocurrent vs Intensity
independent_variables=['Light intensity', 'ChR variant'] dependent_variables=['Steady-state photocurrent']
Prompts:
prompt = """In Panel {panel_name}, plot number {plot_ind}, what values are taken by the independent variable {ind_var_name}?
If the variable is not quantitative (like an image), only set the name field of IndependentVariable.
Return your answer as structured data.
""".format(
panel_name = figure.panels[panel_ind].name,
plot_ind=plot_ind+1,
ind_var_name = iv
)
prompt = """In Panel {panel_name}, plot number {plot_ind}, what statistics used to quantify the deependant variable {dep_var_name} like mean or SEM?
If the variable is not quantitative (like an image), only set the name field of DependentVariable.
Return your answer as structured data.
""".format(
panel_name = figure.panels[panel_ind].name,
plot_ind=plot_ind+1,
dep_var_name = dv
)
Structured Output:
The values for Wavelength
are not correct. Hoping to resolve this by allowing RAG over full paper.
Panel d, Independent Vars:
name='Wavelength' values=[400, 450, 500, 550, 600, 650] unit='nm'
name='ChR variant' values=['ChRger1', 'ChRger2', 'ChRger3', 'ChR_25_9', 'ChR_15_10', 'CsChrimR', 'ChRiff'] unit='None'
Panel d, Dependent Vars:
name='Normalized photocurrent' statistics=['mean', 's.e.m.'] unit='None'
Step 3: Extract dependent variable statistics for each condition
Prompt:
columns = {iv.name: pd.Series() for iv in ind_vars}
columns.update({dv.name: pd.Series()})
df = pd.DataFrame(columns)
prompt = """In Panel {panel_name}, what are the values of the {dep_var_stat} for the dependent variable {dep_var_name}?
If the variable is not quantitative (like an image), only set the name field of IndependentVariable.
Get the value for each condition. To help, here are the values the independent variable takes:
{ind_vars}
IMPORTANT:
THE EXPECTED VALUES FOR THE INDEPENDENT VARIABLES MAY BE INCORRECT. DO YOUR BEST TO ESTIMATE THE VALUE FOR THE INPUTS FROM THE PLOT.
The column schema is the following: {schema}.
""".format(
panel_name = figure.panels[panel_ind].name,
dep_var_stat = dv.statistics[0],
dep_var_name = dv.name,
ind_vars = ind_vars,
schema = ", ".join(df.columns),
)
Structured Output: | FIELD1 | Wavelength | ChR variant | Normalized photocurrent |
---|---|---|---|---|
0 | 400 | ChRiff | 0.65 | |
1 | 450 | ChRiff | 0.95 | |
2 | 500 | ChRiff | 1.0 | |
3 | 550 | ChRiff | 0.95 | |
4 | 600 | ChRiff | 0.65 | |
5 | 650 | ChRiff | 0.55 | |
6 | 400 | CsChrimR | 0.5 | |
7 | 450 | CsChrimR | 0.75 | |
8 | 500 | CsChrimR | 1.0 | |
9 | 550 | CsChrimR | 0.95 | |
10 | 600 | CsChrimR | 0.6 | |
11 | 650 | CsChrimR | 0.5 | |
12 | 400 | C1C2 | 0.45 | |
13 | 450 | C1C2 | 0.7 | |
14 | 500 | C1C2 | 0.95 | |
15 | 550 | C1C2 | 1.0 | |
16 | 600 | C1C2 | 0.65 | |
17 | 650 | C1C2 | 0.55 | |
18 | 400 | 28_10 | 0.9 | |
19 | 450 | 28_10 | 1.0 | |
20 | 500 | 28_10 | 0.85 | |
21 | 550 | 28_10 | 0.45 | |
22 | 600 | 28_10 | 0.05 | |
23 | 650 | 28_10 | 0.0 | |
24 | 400 | 11_10 | 0.7 | |
25 | 450 | 11_10 | 1.0 | |
26 | 500 | 11_10 | 0.8 | |
27 | 550 | 11_10 | 0.2 | |
28 | 600 | 11_10 | 0.05 | |
29 | 650 | 11_10 | 0.0 | |
30 | 400 | 25_9 | 0.95 | |
31 | 450 | 25_9 | 1.0 | |
32 | 500 | 25_9 | 0.8 | |
33 | 550 | 25_9 | 0.5 | |
34 | 600 | 25_9 | 0.05 | |
35 | 650 | 25_9 | 0.0 |
Plotting extracted data:
Note: colors were not extracted and so not matched. Not a perfect result - some of this may come from incorrectly extracting the Wavelength
values.
Extracted | Source |
---|---|
Prompt with both text and example figures.
generator = PlotGenerator(
storage_dir = "~/dev/plotreader/storage",
)
generator.generate(
data_scenario = (
"Scientists are desigining new opsins by mutating existing ones. " +
"They then measure the currents produced when exciting with different wavelengths of light. " +
"They plot comparisons of the different mutants and wild-type as function of these wavelengths."
),
examples_dir='~/dev/plotreader/real_figures/opsins'
)
Two of the example figures it was given (from Machine learning-guided channelrhodopsin engineering enables minimally invasive optogenetics).
Example 1 | Example 2 |
---|---|
Example 1 | Example 2 |
---|---|
The questions were saved into a .json
file
[
{
"difficulty": "Easy",
"question": "Which opsin variant shows the highest peak photocurrent at the higher light intensity (0.8 mW mm^-2)?",
"answer": "ChRger1 shows the highest peak photocurrent at the higher light intensity, with a value of approximately 1500 pA."
},
{
"difficulty": "Medium",
"question": "Comparing ChR2 and ChRger3, what is the approximate percentage increase in peak photocurrent at the lower light intensity (8 \u00d7 10^-3 mW mm^-2)?",
"answer": "At the lower light intensity, ChR2 has a peak photocurrent of about 100 pA, while ChRger3 has a peak photocurrent of about 600 pA. The percentage increase is approximately (600 - 100) / 100 * 100 = 500%. ChRger3 shows a 500% increase in peak photocurrent compared to ChR2 at the lower light intensity."
},
{
"difficulty": "Hard",
"question": "Based on the spectral sensitivity plot, at which wavelength does the difference in normalized photocurrent between ChRger3 and ChR2 appear to be the greatest, and what is the approximate magnitude of this difference?",
"answer": "The difference in normalized photocurrent between ChRger3 and ChR2 appears to be greatest at around 480-490 nm. At this point, ChRger3 has a normalized photocurrent of about 0.75, while ChR2 has a normalized photocurrent of about 0.55. The magnitude of the difference is approximately 0.75 - 0.55 = 0.2, or a 20% difference in normalized photocurrent."
}
]
Three .csv
files were generated that contain the data used to create the figure.
Light intensity response: | Opsin | Light Intensity (mW mm^-2) | Peak Photocurrent (pA) |
---|---|---|---|
ChR2 | 0.008 | 100 | |
ChR2 | 0.8 | 400 | |
CoChR | 0.008 | 200 | |
CoChR | 0.8 | 900 | |
ChRger1 | 0.008 | 850 | |
ChRger1 | 0.8 | 1500 | |
ChRger2 | 0.008 | 600 | |
ChRger2 | 0.8 | 1400 | |
ChRger3 | 0.008 | 600 | |
ChRger3 | 0.8 | 1300 |
Kinetic estimates: | Opsin | τoff (ms) |
---|---|---|
ChR2 | 30 | |
CheRiff | 40 | |
C1C2 | 60 | |
ChRger1 | 25 | |
ChRger2 | 35 | |
ChRger3 | 45 |
Spectral Senitivity: | Wavelength (nm) | Opsin | Normalized Photocurrent |
---|---|---|---|
400 | ChR2 | 0.036150690739091766 | |
405 | ChR2 | 0.05259894465789624 | |
410 | ChR2 | 0.07443440578013699 | |
415 | ChR2 | 0.1024487550337355 | |
420 | ChR2 | 0.13714371482751292 | |
425 | ChR2 | 0.17855885704709237 | |
430 | ChR2 | 0.22611175977895312 | |
435 | ChR2 | 0.27848458915645535 | |
440 | ChR2 | 0.3335918628419484 | |
445 | ChR2 | 0.38865655282174394 | |
450 | ChR2 | 0.4404055716042445 | |
455 | ChR2 | 0.48537329642152754 | |
460 | ChR2 | 0.520277707898721 | |
465 | ChR2 | 0.542413914209154 | |
470 | ChR2 | 0.55 | |
475 | ChR2 | 0.542413914209154 | |
480 | ChR2 | 0.520277707898721 | |
485 | ChR2 | 0.48537329642152754 | |
490 | ChR2 | 0.4404055716042445 | |
495 | ChR2 | 0.38865655282174394 | |
500 | ChR2 | 0.3335918628419484 | |
505 | ChR2 | 0.27848458915645535 | |
510 | ChR2 | 0.22611175977895312 | |
515 | ChR2 | 0.17855885704709237 | |
520 | ChR2 | 0.13714371482751292 | |
525 | ChR2 | 0.1024487550337355 | |
530 | ChR2 | 0.07443440578013699 | |
535 | ChR2 | 0.05259894465789624 | |
540 | ChR2 | 0.036150690739091766 | |
545 | ChR2 | 0.024165313492874083 | |
550 | ChR2 | 0.01571102543150271 | |
555 | ChR2 | 0.009934657380879702 | |
560 | ChR2 | 0.006109948096033269 | |
565 | ChR2 | 0.003654756205507769 | |
570 | ChR2 | 0.002126256076710044 | |
575 | ChR2 | 0.001203120115000587 | |
580 | ChR2 | 0.0006621229971555117 | |
585 | ChR2 | 0.0003544089013127345 | |
590 | ChR2 | 0.00018450444534638154 | |
595 | ChR2 | 9.342117210877588e-05 | |
600 | ChR2 | 4.6006590976350145e-05 | |
400 | CoChR | 0.03668482568419145 | |
405 | CoChR | 0.05033444988644888 | |
410 | CoChR | 0.06766764161830635 | |
415 | CoChR | 0.08913198979252396 | |
420 | CoChR | 0.11503314949690453 | |
425 | CoChR | 0.1454619035230741 | |
430 | CoChR | 0.1802238942989105 | |
435 | CoChR | 0.2187823688501346 | |
440 | CoChR | 0.26022506051035105 | |
445 | CoChR | 0.3032653298563167 | |
450 | CoChR | 0.34628466210259884 | |
455 | CoChR | 0.3874187144416247 | |
460 | CoChR | 0.4246829082841562 | |
465 | CoChR | 0.4561270384142726 | |
470 | CoChR | 0.48000272064273886 | |
475 | CoChR | 0.4949239016893669 | |
480 | CoChR | 0.5 | |
485 | CoChR | 0.4949239016893669 | |
490 | CoChR | 0.48000272064273886 | |
495 | CoChR | 0.4561270384142726 | |
500 | CoChR | 0.4246829082841562 | |
505 | CoChR | 0.3874187144416247 | |
510 | CoChR | 0.34628466210259884 | |
515 | CoChR | 0.3032653298563167 | |
520 | CoChR | 0.26022506051035105 | |
525 | CoChR | 0.2187823688501346 | |
530 | CoChR | 0.1802238942989105 | |
535 | CoChR | 0.1454619035230741 | |
540 | CoChR | 0.11503314949690453 | |
545 | CoChR | 0.08913198979252396 | |
550 | CoChR | 0.06766764161830635 | |
555 | CoChR | 0.05033444988644888 | |
560 | CoChR | 0.03668482568419145 | |
565 | CoChR | 0.02619657053491279 | |
570 | CoChR | 0.018329020976689005 | |
575 | CoChR | 0.012565244492637209 | |
580 | CoChR | 0.008439942074394954 | |
585 | CoChR | 0.005554498269121153 | |
590 | CoChR | 0.0035816822354596085 | |
595 | CoChR | 0.0022629039333833595 | |
600 | CoChR | 0.0014008190174242421 | |
400 | ChRger3 | 0.10150146242745953 | |
405 | ChRger3 | 0.1293162179203146 | |
410 | ChRger3 | 0.1621988751224155 | |
415 | ChRger3 | 0.2002888764197575 | |
420 | ChRger3 | 0.2434893505187623 | |
425 | ChRger3 | 0.2914185956342731 | |
430 | ChRger3 | 0.3433750213287107 | |
435 | ChRger3 | 0.3983219932765089 | |
440 | ChRger3 | 0.45489799478447507 | |
445 | ChRger3 | 0.5114555633927611 | |
450 | ChRger3 | 0.5661297014917555 | |
455 | ChRger3 | 0.6169331717989985 | |
460 | ChRger3 | 0.6618726769384466 | |
465 | ChRger3 | 0.6990768692696456 | |
470 | ChRger3 | 0.7269249258572581 | |
475 | ChRger3 | 0.7441634536951827 | |
480 | ChRger3 | 0.75 | |
485 | ChRger3 | 0.7441634536951827 | |
490 | ChRger3 | 0.7269249258572581 | |
495 | ChRger3 | 0.6990768692696456 | |
500 | ChRger3 | 0.6618726769384466 | |
505 | ChRger3 | 0.6169331717989985 | |
510 | ChRger3 | 0.5661297014917555 | |
515 | ChRger3 | 0.5114555633927611 | |
520 | ChRger3 | 0.45489799478447507 | |
525 | ChRger3 | 0.3983219932765089 | |
530 | ChRger3 | 0.3433750213287107 | |
535 | ChRger3 | 0.2914185956342731 | |
540 | ChRger3 | 0.2434893505187623 | |
545 | ChRger3 | 0.2002888764197575 | |
550 | ChRger3 | 0.1621988751224155 | |
555 | ChRger3 | 0.1293162179203146 | |
560 | ChRger3 | 0.10150146242745953 | |
565 | ChRger3 | 0.07843425096675109 | |
570 | ChRger3 | 0.05966963153867076 | |
575 | ChRger3 | 0.044690489071489564 | |
580 | ChRger3 | 0.032952700217555565 | |
585 | ChRger3 | 0.023921095021617825 | |
590 | ChRger3 | 0.017095635662709257 | |
595 | ChRger3 | 0.012028281956512912 | |
600 | ChRger3 | 0.008331747403681729 |