Parse RTF attachments/bodies

koodaamo / tnefparse

a TNEF decoding library written in python, without external dependencies

GNU Lesser General Public License v3.0

49 stars 37 forks source link

Parse RTF attachments/bodies #26

Closed jrideout closed 5 years ago

jrideout commented 5 years ago

[x] Extract rtfbody (#30)
[x] Add CLI support (#39)
[ ] ~~Parse the rtf in some way, perhaps via an optional dependency.~~

petri commented 5 years ago

What does this mean exactly? Same as tnefparse --htmlbody but for RTF bodies? I seem to remember from a long time ago that RTF is indeed embedded/wrapped in some funky way in TNEF...

jrideout commented 5 years ago

Exactly, we'll want to support tnefparse --rtfbody

jrideout commented 5 years ago

The one thing I'm not certain about is if we need to decompress the rtf, or if the rtf data is valid even when compressed. https://github.com/delimitry/compressed_rtf seems to do what we need to just decompress the data without fully parsing it.

petri commented 5 years ago

Nice! That's a small dependency well worth it I'd think.

jrideout commented 5 years ago

This does what I want for RTF parsing: https://gist.github.com/gilsondev/7c1d2d753ddb522e7bc22511cfb08676

I'd rather not add a dependency for a full rtf parser. Should we include this file in our source, or just leave it outside the scope of the project?

petri commented 5 years ago

Hm. I consider document format conversions to be outside the scope, but some limited use cases might fall on the borderline. If I may ask, what's the goal - just support extraction of plaintext words for indexing, or something else?

jrideout commented 5 years ago

what's the goal - just support extraction of plaintext words for indexing

just that

I consider document format conversions to be outside the scope

I agree. Let's stop here. Users can do their own RTF parsing if desired.

petri commented 5 years ago

I gave this some more thought. Conversions of tnef body content in general are out of scope of tnefparse.

But I am pretty sure I remember the RTF/HTML bodies are to some extent specific to the MS TNEF implementations, with some quirks and deviations. That makes me think extraction of plaintext is something that's within the scope here.

So that can be revisited.