neuml / paperetl

📄 ⚙️ ETL processes for medical and scientific papers
Apache License 2.0
352 stars 27 forks source link

Use figure index rather than xml:id attribute this is not always present #51

Closed elshimone closed 11 months ago

elshimone commented 11 months ago

This is particularly true when loading papers which are parsed from PDFs e.g #46

If this just needs to be unique within the scope of the document, then perhaps we can use an index instead.

davidmezzetti commented 11 months ago

Thank you for the PR! I know that issue has been open a while.

The xml:id has some value in that it shows if it's a figure or a table. The section name doesn't matter all that much, in fact in some cases it's empty.

How about going with this?

        # Extract text from tables
        for i, figure in enumerate(soup.find("text").find_all("figure")):
            # Use XML Id (if available) as figure name to ensure figures are uniquely named
            name = figure.get("xml:id")
            name = name.upper() if name else f"FIGURE_{i}"

            # Search for table
            table = figure.find("table")
            if table:
                sections.extend([(name, x) for x in Table.extract(table)])
elshimone commented 11 months ago

Yes sounds good - fyi I did see a mixture of figures with and without ids (or rather, figure like ids, e.g. fig_1,fig_2, , fig_3, fig_4 .....) within the same xml document so I decided to bin it. Makes sense to preserve it where possible though.