Closed fdq09eca closed 3 years ago
Hi @fdq09eca, it appears that the character highlighted in blue in the screenshot below is causing the issue. It is positioned just a little bit to the left of those cut-off characters, enough to confuse the text-strategy algorithm.
The text-strategy algorithm could certainly use some improvement. In the meantime, one approach would be to use a crop that is more focused on the table itself, excluding the problematic character. This code, modified from what you provided, seems to work (although I don't know if it achieves the ultimate goal you are seeking):
df = pd.DataFrame(page.chars)
headings = (
df
.loc[lambda df: (
df["fontname"].str.contains("HelveticaNeueLTStd-Md") &
(df["size"] == 12)
)]
.groupby([ "top", "bottom" ])
['text']
.apply(''.join)
.reset_index()
)
subheadings = (
df
.loc[lambda df: (
df["fontname"].str.contains("HelveticaNeueLTStd-Md") &
(df["size"] == 9)
)]
.groupby([ "top", "bottom" ])
['text']
.apply(''.join)
.reset_index()
)
target_top = (
subheadings
.loc[lambda df: df["text"].str.contains(r'Services rendered')]
["top"]
.values[0]
)
target_bottom = (
headings
.loc[lambda df: df["top"] > target_top]
["top"]
.values[0]
- 1
)
cropped = page.crop((0, target_top, page.width, target_bottom))
print(cropped.extract_text())
print('====')
setting = {
"vertical_strategy": "text",
"horizontal_strategy": "text",
}
for li in cropped.extract_table(setting):
if re.match(r'.*?\d+|^.{3,4}(\d{3})?$', li[-1]):
print(''.join(li[:-1]), "|",li[-1])
Output:
Services rendered Fee paid/payable 提供服務 已付╱應付費用
HK$’000 千港元
Audit services 1,250 核數服務 1,250
Non audit services 160 非核數服務 160
====
HK$’000 | 千港元
Audit services1,250核數服務 | 1,250
Non audit services160非核數服務 | 160
@jsvine thank you. is it possible to extend the edge allowance to include that 0?
I don't quite understand what you mean by "edge allowance"? Can you provide more details about the question?
Just checking back on this, @fdq09eca.
I have the following block of code, forgive me.
it gives me this
you may see after extracting the table, the trailing zero disappeared. (before - after:
1,250
-1,25
,160
-16
) my question is, how do I get the trailing zero? I trieddebug_tablefinder
but I can't find out the reason.