vamseeachanta / assetutilities

Utilities for performing day to day tasks. Helps in automation of day to day business tasks
MIT License
0 stars 2 forks source link

word | Search for LNGC #7

Open vamseeachanta opened 4 months ago

vamseeachanta commented 4 months ago

See file in below location: assetutilities\src\assetutilities\tests\test_data\word_utilities

Search for text ex: "LNGC":

The output should be 3 times 1 : page 1, Section? 2: page 2, Section? 3: page 3, Section?

vamseeachanta commented 4 months ago

For guidance: https://stackoverflow.com/questions/56374018/how-to-search-and-replace-a-word-text-in-word-document-using-python-docx

JayachandraJangiti commented 3 months ago

Program for searching a word in word Document.

It prints total number of occurences and the paragraph for each occurence in which the given word is found.

Code :

from docx import Document

def search_word_in_docx(docx_file, search_word):
    doc = Document(docx_file)
    occurrences = []
    for i, paragraph in enumerate(doc.paragraphs, start=1):
        for run in paragraph.runs:
            if search_word in run.text:
                occurrences.append((i, run.text))
                break  # Stop searching in this paragraph after finding the word

    print("Total number of Occurences : ",len(occurrences))
    print()
    i=1
    for occurrence in occurrences:
        print(f"{i}) Word '{search_word}' found in paragraph {occurrence[0]}: '{occurrence[1]}'")
        print("\n\n")
        i+=1

# Sample Input
docx_file = "D:\Internship\V-Project\Search\myfile.docx"
search_word = "threat"

search_word_in_docx(docx_file, search_word)

Its output :

Total number of Occurences : 7

1) Word 'threat' found in paragraph 6: 'Phishing attacks continue to pose a significant threat to the Internet economy, resulting in billions of dollars in losses each year. These attacks, most commonly carried out through email, '

2) Word 'threat' found in paragraph 44: 'This study tells the significance of addressing phishing email detection using Natural Language Processing (NLP) techniques, considering the substantial financial losses and threats posed by phishing attacks. While previous review studies have been conducted, none have comprehensively explored NLP techniques for phishing detection, except for one that focused on classification and training. To fill this gap, this systematic review analyzes 100 research articles published between 2006 and 2022.'

3) Word 'threat' found in paragraph 58: 'Phishing activity poses a significant threat to computer networks and financial systems, allowing hackers to compromise data and exploit resources for cybercrime. With cybercrimes projected to cost the world an estimated $6 trillion by 2021, phishing continues to be a growing challenge. The exponential growth of phishing incidents over the past decade supports this trend.'

4) Word 'threat' found in paragraph 88: 'Phishing activity is a major concern for hackers, as it allows them to compromise computer networks and gain access to valuable data and processing resources. The projected cost of cybercrimes, including phishing, is estimated to reach $6 trillion by 2021, indicating the significant financial impact of these activities. Phishing numbers have witnessed exponential growth over the last decade, suggesting an increasing challenge in combating this threat. Recent reports emphasize the evolving complexity of phishing, especially related to phishing URLs, making it difficult to build a resilient cyberspace. Compounding the issue is the shortage of cyber security expertise to handle the expected rise in incidents. Previous research has proposed various methods, such as neural networks, data mining, heuristics, and machine learning, to detect phishing websites. However, phishers have adopted more sophisticated techniques like VoIP phishing and spear phishing, which traditional detection methods struggle to accurately identify. Thus, there is a pressing need for modern tools and techniques to serve as countermeasures against these advanced phishing attacks. It is essential to develop state-of-the-art anti-phishing tools that can predict and prevent phishing attacks before they occur. Such tools would enable users to comprehend the risks associated with their personal and financial data.'

5) Word 'threat' found in paragraph 194: 'Phishing poses a significant security threat with detrimental effects on both individuals and targeted brands. Despite its existence for a considerable period, phishing attacks remain highly active and successful. Attackers continuously evolve their tactics to enhance the convincing and effectiveness of their attacks. In this context, the detection of phishing attacks becomes of paramount importance. The literature offers a wide range of solutions, particularly in the domain of phishing website detection.'

6) Word 'threat' found in paragraph 212: 'Phishing is a serious security threat that has been around for a long time and is still very active.'

7) Word 'threat' found in paragraph 220: 'Phishing is a serious security threat that organizations face. In order to better understand the effectiveness of phishing prevention measures, we conducted a large-scale and long-term phishing experiment. The experiment ran for 15 months, and involved sending simulated phishing emails to more than 14,000 employees of a large company.'

Above are the paragraphs of the word document in which an occurence of the given word is found.

If we want only the paragraph number :

Change print(f"{i}) Word '{search_word}' found in paragraph {occurrence[0]}: '{occurrence[1]}'") as print(f"{i}) Word '{search_word}' found in paragraph {occurrence[0]}")

This code will not find the page number in which the occurence is. So, this code is to be improved further.