kota7 / striprtf

R Package for Extracting Text from RTF (Rich Text Format) File
Other
19 stars 3 forks source link

Table entries merging #2

Closed micahjames closed 7 years ago

micahjames commented 7 years ago

Greetings!

I just found striprtf today and it very nearly solves a problem I've been trying to solve on my own for the last several days. There was one thing getting in my way, though. I need to be able to extract data from tables in these rtf files. Right now, the output of read_rtf is simply the concatenation of all of the cells of a given table with no indication of where one entry ends and the next begins. For instance, if the following table were in an rtf document

A B C
1.01 2.02 3. 03

then the output of read_rtf would be:

[1]  "ABC1.012.023.03"  ""

However, if you added cell and row to the keys vector in .specialchars in global.R with corresponding values x0009 and x000A in the hexstr vector, and \t and \n in the str vector, then the same table would output as

  [1] "A\tB\tC" "1.01\t2.02\t3.03" ""

In case it is helpful, I'm attaching an rtf with this table.

Thanks so much for writing this package. I couldn't believe my luck today!

-Micah

table.rtf.zip

kota7 commented 7 years ago

Hi Micah, thank you for using the package, and for the helpful suggestion. This table format aspect is something that I did not even know.

I will have a look and add this feature in a few days. I guess your suggestion would work well.

kota7 commented 7 years ago

Hi, updated version (v.0.4.1) has been loaded to the master branch. I would like you to try it. Since it is not on CRAN yet, please install by:

devtools::install_github("kota7/striprtf")

Let me know if you have any trouble with the installation.

This version parses tables specially with user-defined row_start, row_end, cell_end characters. For example, you may set cell_end="\t" to make tables tab-separated (with an extra at the end). row_start option helps identify where are the tables, perhaps is convenient for a large document and/or with many tables. No need to set row_end="\n"; rows are automatically separated.

I hope you like this. Please let me know your questions, comments or complaints, if any.

micahjames commented 7 years ago

I was able to install the new version without problem and it works perfectly for my purposes. Thank you so much for making the changes!

kota7 commented 7 years ago

Not at all!

I will notice here when the new version is uploaded to CRAN, then close this thread. It may take a few days since I need to do some bug fix for other issues.

Thank you again for your suggestion.

kota7 commented 7 years ago

Now on CRAN (v.0.4.3)!