joshy / striprtf

Stripping rtf to plain old text
http://striprtf.dev
BSD 3-Clause "New" or "Revised" License
94 stars 27 forks source link

Not reading in a link(?) #10

Closed azade-k closed 2 years ago

azade-k commented 4 years ago

Hi,

I have roughly 2000 rtf files containing articles and I realized that it does not read in the title which is formatted in blue and underlined (like a link but I cannot find an actual link behind it). If I manually set the title to "normal" formatting, it does read it in though. Do you have an idea where that might come from?

Thanks for your help!

joshy commented 4 years ago

Hi,

if you can provide a sample rtf file I can check.

Best regards, Joshy

-- Sent from phone

On 4 Jun 2020, at 13:47, azade-k notifications@github.com wrote:

 Hi,

I have roughly 2000 rtf files containing articles and I realized that it does not read in the title which is formatted in blue and underlined (like a link but I cannot find an actual link behind it). If I manually set the title to "normal" formatting, it does read it in though. Do you have an idea where that might come from?

Thanks for your help!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

azade-k commented 4 years ago

Hi,

since they are news articles, the copyright restricts me from attaching one of the files. I am trying to replicate the problem to a test file without actual data.

Best Azade

azade-k commented 4 years ago

I created a sample file example.zip with the rtf and the txt that is the outcome

Also, here is the code I use:

`#script to transform rtf into txt

imports (might need to use pip install first)

import glob from striprtf.striprtf import rtf_to_text

replace following line with location of your .docx file

inputfiles = 'example/*.rtf'

listing all rtf files in the directory

input_list = sorted(glob.glob(inputfiles))

loading and converting

for file in input_list: f = open(file).read() rtfastext = rtf_to_text(f) output= file.replace('.rtf', '.txt') with open(output, "w") as text_file: print(rtfastext, file=text_file)`

I checked again and there actually is a hyperlink in the text. The code works with files without hyperlinks but not with them.

Sorry for the terrible formatting. I did not quite figure out how to insert code correctly :/

joshy commented 4 years ago

Hi Azade,

it seems the links are completely removed as they are inside a block which is ignored. I need to check the effort to implement a solution.

Best, Joshy

joshy commented 4 years ago

It is trickier than expected...

carlafdzzz commented 3 years ago

hey, could you solve the issue with hyperlink? It happens to me as well, the first line which contains the link is completely removed.

joshy commented 3 years ago

There is some progress on this task, see branch hyperlinks. The idea was to extract the link destination and the link text with a separate regex pattern. That seems to work for some example but not the one provided by @azade-k.

HaddadJoe commented 3 years ago

hey @joshy thanks for your work! Gave your branch a quick spin with this example. ~It worked fine~, The document uploaded is a small snapshot of a larger file containing the full document. Parsing it on the small file it works fine. Parsing the larger document the hyperlink detection fails

any ideas on how we can help?

HyperlinksTableTest.rtf.zip

joshy commented 3 years ago

hey @HaddadJoe thanks for checking it out. The main problem is to find a regex pattern for hyperlinks that works across all different examples. This is the regex pattern in the hyperlinks branch:

HYPERLINKS = re.compile(
    r"(\{\\field\{\n?\\\*\\fldinst\{.*HYPERLINK\s(\".*\")\}{2}\s?\{.*\s+(.*)\}{2})",
    re.IGNORECASE

If you have time and energy you can use some online regex tester (e.g. https://regex101.com, make sure you set flavor to python) and test the pattern with the rtf files and find out if it works. That could be one way to help.

Other ideas to tackle hyperlinks are also welcome.

Arzemn commented 2 years ago

I think i ran into the same issue, @joshy would it be possible to just try and capture the text in the {\fldrslt Service:}} portion of the link parameter.. In my case i am not interested in the hyperlink itself, but the text link is part of the text body

joshy commented 2 years ago

@carlafdzzz @HaddadJoe @Arzemn If I have time I will look into just keeping the link description. Maybe that is easier.

joshy commented 2 years ago

Closed with https://github.com/joshy/striprtf/releases/tag/v0.0.21