Open Turch99 opened 1 year ago
Hello. Thank you for the feedback.
Yes, word_pattern
and allowed_symbols
are the only ones that need to be changed for the non basic Latin / basic Cyrillic processing.
^
and $
are unnecessary in this regex pattern, and may somewhat affect the processing, but not critically.
I'm a little confused – everything else looks working.
You don't have any output? Blank values on symbols
table and empty words
table in the result.db
?
How does your sentence
look like? Give some examples.
I apologize for the long answer. I got a little cold) I didn't write correctly in my previous post. In the result.xlsx file, the script outputs data in the symbols column. This data corresponds to the character range of the allowed_symbols variable. All other columns of the result.xlsx file are empty. I tried different xml files. Including I tried the file from the site https://digiliblt.uniupo.it/ in Latin, that is, a file that should work with the standard regex formula that you give in the example. But the result.xlsx file is still completely empty except for the symbols column. The file result.db (I opened it through DB Browser) contains only the structure, but the data is also empty.
However, when I tried to use xml files in German from the site https://www.euromatrixplus.net/multi-un/ In this case, if I left the code from your example unchanged, then I got the result.db with the structure and data, but result.xlsx did not contain any graphs, but had filled columns Symbols Symbol bigrams Words Word bigrams.
Also I was getting the error
Traceback (most recent call last):
File "C:\xxx \analysis_result.py", line 5, in <module>
res.treat(limits=(1000,) * 4, chart_limits=(20,) * 4, min_quantities=(10,) * 5)
File "C:\xxx \AppData\Local\Programs\Python\Python311\Lib\site-packages\frequency_analysis\results.py", line 205, in treat
self.sheet_top_symbols(limits[0], chart_limits[0], min_quantities[0])
File "C:\xxx \AppData\Local\Programs\Python\Python311\Lib\site-packages\frequency_analysis\results.py", line 262, in sheet_top_symbols
self.__fill_top_data(
File "C:\xxx \AppData\Local\Programs\Python\Python311\Lib\site-packages\frequency_analysis\results.py", line 76, in __fill_top_data
values[1][s][3] = (
^
ZeroDivisionError: float division by zero
There are no problems with xml files with English language. It's possible that I just don't have enough understanding of Python. So I'm sorry)
Thanks again. The error from your traceback can occur if the cased characters from allowed_symbols
don't exist in the text, in either case. I fixed it, you can update your local package now.
But the problem from the first part of your message is not related to this error. I tried to process the .txt files from https://digiliblt.uniupo.it , and it worked for me (the sentences are a bit broken because of readlines()
, but the data is complete). The complete lack of data in your case leaves me stumped. Is there no problem with the parsing of sentences?
Thanks for your reply.
Apparently, the problem was with the xml file.
I tried processing .xml from various large thematic libraries like Perseus or digilibLT and constantly received an empty result file. When I got tired of it, I tried to insert a piece of text in Greek into the markup of an xml file from the site euromatrixplus.net/multi-un /, which you give in the example.
And it worked!
I'll try other sites and maybe try to figure out what the problem is with the xml that I used. And then I will write about it here.
But the main thing is that I was able to conduct an analysis and get the results!
Thanks again for the helpful answers and tips!
Good, glad the problem was localized. Then I will leave this issue open. If you want, you can add your analysis to the examples afterwards.
Hello! Thanks again for your script! It works great! However, I have a question, this is more of a question about Python in general, so I will be grateful for any answer if you find time for this) I want to parse an xml file in greek, do i understand correctly that all i have to change is the word_pattern argument and the allowed_symbols argument? In the first one, should I specify the regex formula, and in the second, the decimal value of the characters? I experimented with different regex formulas like /^[A-Za-zΑ-Ωα-ωίϊΐόάέύϋΰήώ]+$/ or [\u0370-\u03ff\u1f00-\u1fff]. The script runs without errors, but it doesn't output any data. My question is: Am I making a mistake in the regex formula, or am I not setting up something in the script?
My script look like this: