paws-r / paws

Paws, a package for Amazon Web Services in R
https://www.paws-r-sdk.com
Other
315 stars 37 forks source link

Parsing data from "KEY_VALUE_PAIR" Blocks #469

Open sai-sahitya opened 2 years ago

sai-sahitya commented 2 years ago

Thanks for this useful package and for having taken the time to exemplify the extraction of data from "TABLE" type blocks. I'd taken the time to appreciate it by understanding how you'd managed to parse the data from an otherwise impenetrable deeply nested list.

It'd complete the example and be doubly helpful if you could also show us how we can extract data from the "KEY_VALUE_PAIR" block type. Optionally, if this would take you a while, you could refer us to resources that would help us do just that.

I'd like to know why you don't simply convert the "TABLE," "CELL," and "WORD" block types to a Tibble first and then use the Tidyr package's "unnest" or "hoist" functions. Were you looking for a generic function to extract tables from any type of document?

Thanks.

davidkretch commented 2 years ago

Hello, thank you! To understand how to parse TABLE, CELL, and WORD, I used the Textract developer guide. I haven't tried to use KEY_VALUE_SET but its documentation is here: https://docs.aws.amazon.com/textract/latest/dg/how-it-works-kvp.html.

The writeup that example was made for is here: https://aws.amazon.com/blogs/opensource/using-r-with-amazon-web-services-for-document-analysis/ and that may help with some of the reasoning behind it.

With respect to unnest and hoist, I can't say since I'm not familiar with them, sorry. But yeah, the idea was to be able to make a table out of whatever Textract returned. I'm sure there are many ways of doing it.