Closed azade-k closed 2 years ago
Hi,
if you can provide a sample rtf file I can check.
Best regards, Joshy
-- Sent from phone
On 4 Jun 2020, at 13:47, azade-k notifications@github.com wrote:
Hi,
I have roughly 2000 rtf files containing articles and I realized that it does not read in the title which is formatted in blue and underlined (like a link but I cannot find an actual link behind it). If I manually set the title to "normal" formatting, it does read it in though. Do you have an idea where that might come from?
Thanks for your help!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Hi,
since they are news articles, the copyright restricts me from attaching one of the files. I am trying to replicate the problem to a test file without actual data.
Best Azade
I created a sample file example.zip with the rtf and the txt that is the outcome
Also, here is the code I use:
`#script to transform rtf into txt
import glob from striprtf.striprtf import rtf_to_text
inputfiles = 'example/*.rtf'
input_list = sorted(glob.glob(inputfiles))
for file in input_list: f = open(file).read() rtfastext = rtf_to_text(f) output= file.replace('.rtf', '.txt') with open(output, "w") as text_file: print(rtfastext, file=text_file)`
I checked again and there actually is a hyperlink in the text. The code works with files without hyperlinks but not with them.
Sorry for the terrible formatting. I did not quite figure out how to insert code correctly :/
Hi Azade,
it seems the links are completely removed as they are inside a block which is ignored. I need to check the effort to implement a solution.
Best, Joshy
It is trickier than expected...
hey, could you solve the issue with hyperlink? It happens to me as well, the first line which contains the link is completely removed.
There is some progress on this task, see branch hyperlinks. The idea was to extract the link destination and the link text with a separate regex pattern. That seems to work for some example but not the one provided by @azade-k.
hey @joshy thanks for your work! Gave your branch a quick spin with this example. ~It worked fine~, The document uploaded is a small snapshot of a larger file containing the full document. Parsing it on the small file it works fine. Parsing the larger document the hyperlink detection fails
any ideas on how we can help?
hey @HaddadJoe thanks for checking it out. The main problem is to find a regex pattern for hyperlinks that works across all different examples. This is the regex pattern in the hyperlinks branch:
HYPERLINKS = re.compile(
r"(\{\\field\{\n?\\\*\\fldinst\{.*HYPERLINK\s(\".*\")\}{2}\s?\{.*\s+(.*)\}{2})",
re.IGNORECASE
If you have time and energy you can use some online regex tester (e.g. https://regex101.com, make sure you set flavor to python) and test the pattern with the rtf files and find out if it works. That could be one way to help.
Other ideas to tackle hyperlinks are also welcome.
I think i ran into the same issue, @joshy would it be possible to just try and capture the text in the {\fldrslt Service:}} portion of the link parameter.. In my case i am not interested in the hyperlink itself, but the text link is part of the text body
@carlafdzzz @HaddadJoe @Arzemn If I have time I will look into just keeping the link description. Maybe that is easier.
Hi,
I have roughly 2000 rtf files containing articles and I realized that it does not read in the title which is formatted in blue and underlined (like a link but I cannot find an actual link behind it). If I manually set the title to "normal" formatting, it does read it in though. Do you have an idea where that might come from?
Thanks for your help!