AmrapalliKaran commented 4 years ago

I have to pull data from a pdf uploaded at a URL. The pdf is in an image/.png format hence while using the tesseract package few of the lines were not recognized.

The code: library(rvest) library(dplyr) library(pdftools) library(tesseract)

url="https://www.hindustancopper.com/Page/PriceCircular" links=url %>%

reading the html of the url

read_html()%>%

fetching out the nodes and the attributes

html_nodes("#viewTable li:nth-child(1) a") %>% html_attr("href")%>%

replacing few strings

str_replace("../..",'') str(links)

using pdftools to read the pdf

base_url <- 'https://www.hindustancopper.com' event_url <- paste0(base_url, links) event_url

since the link has a scan copy and not the pdf itself hence using tesseract package

pdf_convert(event_url, pages = 1, dpi = 850, filenames = "page1.png") text <- ocr("page1.png") cat(text)

The actual output reads the list of products and its prices as: CONTINUOUS CAST COPPER WIRE ROD 11 MM 44567 CONTINUOUS CAST COPPER WIRE ROD NS 439678 CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056...etc.

The expected output should be: CONTINUOUS CAST COPPER WIRE ROD 11 MM 441567 CATHODE FULL 434122 CONTINUOUS CAST COPPER WIRE ROD NS 439678 CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056...etc

I have tried several times changing the value of dpi argument but that did not help much. What else should be added as an argument to the functions that I might be missing.Thanks in advance!

jeroen commented 4 years ago

Which OS do you have? What is your tesseract::tesseract_info() ?

In the example I tried, the image was a bit skewed. You could improve results by rotating it:

url <- 'https://www.hindustancopper.com/Upload/Reports/0-637189269505122500-AnnualReport.pdf'
library(magick)
image_read_pdf(url) %>%
 image_rotate(3) %>% 
  image_ocr() %>% 
  cat

The docs have some more ideas on how to preprocess the images to improve the OCR performance:

https://docs.ropensci.org/tesseract/articles/intro.html#preprocessing-with-magick

AmrapalliKaran commented 4 years ago

Which OS do you have? What is your tesseract::tesseract_info() ?

In the example I tried, the image was a bit skewed. You could improve results by rotating it:
url <- 'https://www.hindustancopper.com/Upload/Reports/0-637189269505122500-AnnualReport.pdf'
library(magick)
image_read_pdf(url) %>%
 image_rotate(3) %>% 
  image_ocr() %>% 
  cat
The docs have some more ideas on how to preprocess the images to improve the OCR performance:

https://docs.ropensci.org/tesseract/articles/intro.html#preprocessing-with-magick

OS: Windows 10 Pro

tesseract::tesseract_info() $datapath [1] "C:\Users\xyz\AppData\Local\tesseract4\tesseract4\tessdata/"

$available [1] "eng" "osd"

$version [1] "4.1.0"

$configs [1] "alto" "ambigs.train" "api_config" "bigram" "box.train" "box.train.stderr" [7] "digits" "get.images" "hocr" "inter" "kannada" "linebox"
[13] "logfile" "lstm.train" "lstmbox" "lstmdebug" "makebox" "pdf"
[19] "quiet" "rebox" "strokewidth" "tsv" "txt" "unlv"
[25] "wordstrbox"

AmrapalliKaran commented 4 years ago

Thanks, it did solve the issue to a larger extent but the '|' or '[' generated in front of 'CATHODEFULL' is noticeable. How one should get rid of that?

ropensci / tesseract

The text is not recognized from png #48

reading the html of the url

fetching out the nodes and the attributes

replacing few strings

using pdftools to read the pdf

since the link has a scan copy and not the pdf itself hence using tesseract package