mitko / readable_climate_reports

Make climate reports machine readable, so they can be rendered in various inclusive ways
MIT License
4 stars 0 forks source link

Discuss Readme #2

Open mitko opened 2 years ago

mitko commented 2 years ago

From @petermr 's branch:

readable_climate_reports

(PMR: editing in petermr branch and adding comments)

Purpose

Make climate reports machine readable, so they can be rendered in various inclusive ways.

Yes - also searchable by text and data content)

We aim create software tools to enable the parsing, understanding and rendering of climate reports by globally significant organizations like IPCC, UNEP, and UNFCCC.

*Yes - and maybe also preprints and other authorities

We hope this enables people around the world to more easily digest these reports directly, instead of having to always rely on media or influencers explaining them. YES this is a key point. We can add explanataions of terms via our dictionaries, based on Wikidata

Scope

This is a new project, we start with the latest report as of April 8 2022, which is the Mitigation section of the Sixth Assessment report, which can be found at https://www.ipcc.ch/report/ar6/wg3/

We will cover both the executive summary, the technical summary, and eventually the full report.

PMR I think the full report covers all sections

Approach

1. Parsing

First, we have to parse a lossless representation of the report. The output is a data structure which contains each page, each line, and each figure.

PMR. pages and lines are artefacts of PDF. We can move beyond them rapidly I hope. Figures contain text-captions and bitmap/pixel images (very few vector graphics)

Yes I am working on 10 pages

Yes - characters+coordinates+style , images, and a few vector graphics

2. Logical rendering

Once we have the this basic data structure, we can convert it to a logical representation having paragraphs, nested paragraphs, sections, subsections, figures, references, datasets, etc.

Fully agreed. Am wrking on this - I have a lot of experience. There may be some tools I don't know which add functionality

Again, this can be saved into a SQLite DB, but the types of objects would now match the way people may explain these when talking to one another.

do we need a DB? I have used Elastic in the past but I "lost touch" with the text. XML is good for this

At this stage we can also determine confusing acronyms, create glossary, link sections based on references, etc.

our dictionary structures are designed to support this. We create many different dictionaries for different purposes.

3. Visual rendering

Having parsed and understood the data, we can create export it to Markdown, HTML, Roam/Notion or other formats.

Yes

At the very least, we'd like to create a lightweight version that everyone can access. Over time, there can be more and more rendering environments.

Yes. Some people use XSLT stylsheets fot his, others use CSS. Main thing is to have a base of XML/HTML

We can also consider exposing APIs, to make it easy for others to integrate with this work.

Key thing is to have all sections identified by uniqueIds

4. Searching and annotation

Each section has various implied semantics and ontology. We can use supervised (with dictionaries) classifiers to determine the classes.

We can also build indexes based on text. Good candidates include:

mitko commented 2 years ago

Yes - also searchable by text and data content

💯. Did you see the tweet about Weaviate? If we are able to do semantic indexing, that would open up some really interesting queries, where people won't need to know jargon.

PMR I think the full report covers all sections

Correct, though it is not the same I think as executive vs technical. Executive is heavily edited.

PMR. pages and lines are artefacts of PDF. We can move beyond them rapidly I hope. Figures contain text-captions and bitmap/pixel images (very few vector graphics)

Yes, agreed. We want to work on the level of sections/subsections/titles, etc :) I was thinking parsing the PDF will be hard, but it seems like there's some good tooling, so we may not need to do other intermediate representation - just go straight to logical representation.

I think XML/SVG is better for text as it is designed for styling and hyperlinks. Will explain why. I haven't used XML much :)

do we need a DB? I have used Elastic in the past but I "lost touch" with the text. XML is good for this SQLite is just a file, that is formatted like a database