Open awtkns opened 11 months ago
I think we need this asap, because google vision is not working as expected for any complex website. I am working on this.
@shubhamofbce let me know if you need support!
bump; very interested in testing this library out using textract output
@plamb-viso happy to take a PR! It should be fairly straightforward as we have this somewhat abstracted.
We'd also really like to test out Azure OCR as we've heard its the most performant. (Will make a separate issue for this)
And any luck @shubhamofbce ?
@asim-shrestha Sorry I have not update. I looked into it long back, it was straight forward but didn't get a chance to complete it and create a PR and now I don't have that with me.
No worries @shubhamofbce , did you still want to tackle this?
Sorry, but I will not be able to work on it due to time constraint. @asim-shrestha
I think I should be able to tackle this next week
Hey @Loeing let me know if you you need any support on this one.
@awtkns sorry this past week has been busier than anticipated. Have been playing around with Tarsier. Should be able to make some progress by the end of next week
@Loeing I'm super interested in the ability to integrate with Amazon Textextract. Have you made any progress on this? Is there any chance I can be of some assistance?
Howdy! I pulled down the code and tried my hand at integrating with AWS Textract. I ran into a small problem, Textract only returns normalized geometry data (values between 0 and 1), which differs from GCP & Azure. This seems to cause an issue with this line of the format_text
method, which checks spacing between annotations using 10 pixels as its baseline. Since the data is normalized, everything gets squished onto one line in the output. De-normalizing the data (multiplying the normalized values by the height/width of the image) fixed the issue and produced correct looking output. The question I have is: would you rather I just de-normalize the Textract response data or should the format_text
function be updated to only operation using normalized values?
Does anyone have any published WIP branches available to look at? Thanks
Currently the only OCR service tarsier supports is GoogleOCR vision. It would be good to provide another ocr service that allows textextract to be used