Closed elshimone closed 11 months ago
Thank you for the PR! I know that issue has been open a while.
The xml:id has some value in that it shows if it's a figure or a table. The section name doesn't matter all that much, in fact in some cases it's empty.
How about going with this?
# Extract text from tables
for i, figure in enumerate(soup.find("text").find_all("figure")):
# Use XML Id (if available) as figure name to ensure figures are uniquely named
name = figure.get("xml:id")
name = name.upper() if name else f"FIGURE_{i}"
# Search for table
table = figure.find("table")
if table:
sections.extend([(name, x) for x in Table.extract(table)])
Yes sounds good - fyi I did see a mixture of figures with and without ids (or rather, figure like ids, e.g. fig_1,fig_2,
This is particularly true when loading papers which are parsed from PDFs e.g #46
If this just needs to be unique within the scope of the document, then perhaps we can use an index instead.