kota7 / striprtf

R Package for Extracting Text from RTF (Rich Text Format) File
Other
19 stars 3 forks source link

Hangs on embedded BMPs #16

Open strazto opened 4 years ago

strazto commented 4 years ago

I've noticed that this tool tends to hang when attemping to process an embedded bmp.

For the time being, it would be nice if it could just remove images, if the destination is a pict, since I don't really see (Without parsing and somehow emitting to a file) the value in including these.

strazto commented 4 years ago

My mistake - For some reason in this particular file pict was formatted as pIct

(actually for this particular entry the whole cell seems to have had the i's changed to uppercase... bizarre)

striprtf was hanging uninformatively prior to resolving this, however.

kota7 commented 4 years ago

@mstr3336 Thanks for letting me know. If possible, can you post a file for which this library hangs? I would be happy to look into it.

strazto commented 4 years ago

So it's definitely a problem with with the file itself (Every single lowercase i was changed to upper)

I'd like to be able to do that for you but unfortunately the file is medical notes, so I'm unable to share it under ethics.

kota7 commented 4 years ago

@mstr3336 I understood. I will take a look at this anyways since hanging without a clue is not a great behavior. Thanks for the info.

kota7 commented 4 years ago

@mstr3336 I tried creating an RTF file with a BMP image pasted. And also tried replacing (1) all 'i' to 'I' and (2) 'pict' by 'pIct'. For both cases, both LibreOffice Writer and Goole Doc does not recognize the image. Please find the files attached as a reference: with-bmp.zip

Still, read_rtf function works for these files with ignoring the 'pict' part (on my computer). With this, I think perhaps the hanging behavior is due to some other reason. One possibility I could think of is that your document has a very high-resolution image, which exceeds the size limit or causes integer overflow during the process.

Since you cannot share your example, in case you can kindly explore this issue, you can folk this repo and uncomment several debug lines in the source code to see where the process is stuck. https://github.com/kota7/striprtf/blob/0663b59a25f8597baf5a87a2a593aeda8ea839c5/src/strip_helper.cpp#L211-L212

I think we should keep this issue and will be more than happy to hear any further clues.