uqqu / frequency_analysis

package for symbol/word and their bigrams frequency analysis
1 stars 0 forks source link

Parsing xml with greek letters #2

Open Turch99 opened 1 year ago

Turch99 commented 1 year ago

Hello! Thanks again for your script! It works great! However, I have a question, this is more of a question about Python in general, so I will be grateful for any answer if you find time for this) I want to parse an xml file in greek, do i understand correctly that all i have to change is the word_pattern argument and the allowed_symbols argument? In the first one, should I specify the regex formula, and in the second, the decimal value of the characters? I experimented with different regex formulas like /^[A-Za-zΑ-Ωα-ωίϊΐόάέύϋΰήώ]+$/ or [\u0370-\u03ff\u1f00-\u1fff]. The script runs without errors, but it doesn't output any data. My question is: Am I making a mistake in the regex formula, or am I not setting up something in the script?

My script look like this:

import io
from datetime import datetime
from os import listdir
from bs4 import BeautifulSoup

import frequency_analysis

start = datetime.now()
file_list = listdir('444/')
word_pattern = '^[A-Za-zΑ-Ωα-ωίϊΐόάέύϋΰήώ]+$'
    #this is one of the regex options but i tried different
allowed_symbols = [*range(913, 1000)]
    #here i also tried different options
with frequency_analysis.Analysis(
    word_pattern=word_pattern, allowed_symbols=allowed_symbols
) as analyze:
    for n, file in enumerate(file_list):
        with io.open('444/' + file, mode='r', encoding='utf-8') as f:
            data = f.read()
        bs_data = BeautifulSoup(data, 'xml')

        for sentence in bs_data.find_all('s'):
            analyze.count_all(sentence.text.split(), pos=True)
        print(n, file)
print('fin at:', datetime.now().strftime('%H:%M:%S'))
print('total time:', datetime.now() - start)
uqqu commented 1 year ago

Hello. Thank you for the feedback.

Yes, word_pattern and allowed_symbols are the only ones that need to be changed for the non basic Latin / basic Cyrillic processing.

^ and $ are unnecessary in this regex pattern, and may somewhat affect the processing, but not critically. I'm a little confused – everything else looks working.

You don't have any output? Blank values on symbols table and empty words table in the result.db?

How does your sentence look like? Give some examples.

Turch99 commented 1 year ago

I apologize for the long answer. I got a little cold) I didn't write correctly in my previous post. In the result.xlsx file, the script outputs data in the symbols column. This data corresponds to the character range of the allowed_symbols variable. All other columns of the result.xlsx file are empty. I tried different xml files. Including I tried the file from the site https://digiliblt.uniupo.it/ in Latin, that is, a file that should work with the standard regex formula that you give in the example. But the result.xlsx file is still completely empty except for the symbols column. The file result.db (I opened it through DB Browser) contains only the structure, but the data is also empty.

However, when I tried to use xml files in German from the site https://www.euromatrixplus.net/multi-un/ In this case, if I left the code from your example unchanged, then I got the result.db with the structure and data, but result.xlsx did not contain any graphs, but had filled columns Symbols Symbol bigrams Words Word bigrams.

Also I was getting the error

Traceback (most recent call last):
   File "C:\xxx \analysis_result.py", line 5, in <module>
     res.treat(limits=(1000,) * 4, chart_limits=(20,) * 4, min_quantities=(10,) * 5)
   File "C:\xxx \AppData\Local\Programs\Python\Python311\Lib\site-packages\frequency_analysis\results.py", line 205, in treat
     self.sheet_top_symbols(limits[0], chart_limits[0], min_quantities[0])
   File "C:\xxx \AppData\Local\Programs\Python\Python311\Lib\site-packages\frequency_analysis\results.py", line 262, in sheet_top_symbols
     self.__fill_top_data(
   File "C:\xxx \AppData\Local\Programs\Python\Python311\Lib\site-packages\frequency_analysis\results.py", line 76, in __fill_top_data
     values[1][s][3] = (
                       ^
ZeroDivisionError: float division by zero

There are no problems with xml files with English language. It's possible that I just don't have enough understanding of Python. So I'm sorry)

uqqu commented 1 year ago

Thanks again. The error from your traceback can occur if the cased characters from allowed_symbols don't exist in the text, in either case. I fixed it, you can update your local package now.

But the problem from the first part of your message is not related to this error. I tried to process the .txt files from https://digiliblt.uniupo.it , and it worked for me (the sentences are a bit broken because of readlines(), but the data is complete). The complete lack of data in your case leaves me stumped. Is there no problem with the parsing of sentences?

Turch99 commented 1 year ago

Thanks for your reply. Apparently, the problem was with the xml file. I tried processing .xml from various large thematic libraries like Perseus or digilibLT and constantly received an empty result file. When I got tired of it, I tried to insert a piece of text in Greek into the markup of an xml file from the site euromatrixplus.net/multi-un /, which you give in the example.
And it worked! I'll try other sites and maybe try to figure out what the problem is with the xml that I used. And then I will write about it here. But the main thing is that I was able to conduct an analysis and get the results! Thanks again for the helpful answers and tips!

uqqu commented 1 year ago

Good, glad the problem was localized. Then I will leave this issue open. If you want, you can add your analysis to the examples afterwards.