Open D-K-E opened 7 years ago
Hi,
Thanks for accepting our invitation. I hope we can work together when needed and help each other. :)
From what I understand, your approach is to read and parse the data for storing it into json or xml as you said. While our main aim is to look at the structure and find out the flaws in it. Suggest correction and warnings if there is any violation.
We have just begun our implementation of the atf parser. We are using PLY (Python Lex Yacc) tool for python as it tokenizes and reads the file for us.
I would suggest using the flex tool in python for tokenization. (step 6) It'll do that job for you. :) Also, some texts do not have object type specified as shown in this example : &P227691 = CBS 12666
@reverse @column 1
A 56 ||B 229
A 57 ||B 231
A 58 ||B 237
A 59 ||B 254
A 89 ||B 398
A 90 ||B 360
A 91 ||B 369
A 93 ||B 373
A 92
Dividing such texts using @ would give error according to you current algorithm.
Your suggestions on our current implementation are welcome as well. :)
Also, can you explain your enumeration part in detail again? I do not understand it right now. Maybe once I do, its possible we could optimise it :)
Hi thank you for pointing out the "@", i'll see what i can do about it. Does cdli provide anywhere the structure of the texts it contains ? I started writing the parser thinking that the examples provided in the c-atf section are general enough to work with the material. Just for the sake clarification, json and xml output are really secondary goals, my first goal is to parse the text into a tree-like structure, then provide an api interacting with the text tree. Then if the user really wants to exploit the document in another environment, she/he can export it to json, or xml. As for the enumerations, once the text tree is established: so for something like
text = {
text_parts = [
{text_part_1},
{text_part_2_property:Something,
text_part_2_lines:[
{text_part_2_line_1}
{text_part_2_line_2_property:Something,
text_part_2_line_2_words = [
{word_1}
{word_2_property:Something,
word_2_signs:[
{word_2_sign_1}
{word_2_sign_2}
] # word_2_signs
}
] # word
}
] # line
}
] # part
}
At the enumerations stage, the algorithm would start indexing from branches, like signs first then words, then lines, then parts. These index values would be used for determining relative sign/word positions and relative sign/word counts. They can be used for accessing to the parts of the document from the api or navigating in the document with the api later on. The exact implementation of this stage depends on the design of the api, and i've not really decided on that yet. As i've said my current goal is to arrive to the tree like structure above.
Just a quick thing for the flex tool you've mentioned, what is the tokenization level, is it word level, or sign level? Reread your response, it's word level, i am sorry.
Another thing though, is the example you've provided is in C-ATF ? The documentation says there are only two protocols provided:
atf: lang XXX
atf: use math
yours use atf: use lexical,
if the issue concerns the documentation, is there a more updated version of it somewhere ?
Here is an update on the current status of the parser with some sample output.
It is far from complete. But i thought the lineGetter and alHandler classes might help you with parsing the entities starting with the underscore. It took me awhile to figure out a relatively error free way to get those occurances.
If i took the alHandler class from your perspective, if the results of the grouping methods indicated in the class doesn't have a None value, then there is a good chance that there is a closing underscore for every opened underscore. Whether the underscore is closed after 1 or more lines doesn't matter.
Hi, thank you for the invitation!, I am sending you a gist of what i have so far, here.
In terms of architecture of the parser, at the low level there are testers and there will be getters who uses the tests to select the information in order to get from it. I have written tests for almost all the main components of the c-atf, except the Place Value Notation for numbers, which i have a very vague idea about what it means. The final output will be a python dict, depending on the user input, a json file, or an xml file could be generated from there. An example of the general schema of the python dict is at the end of the gist. Some features included in the schema require several passing, like absolute and relative positions.
The basic process of the algorithm in my mind works like the following: This probably is going to change along the way. Extraction of the information about the text: 1) Extract the c-atf part from an archival view 2) Split the text into parts with "@" 2a) Get the text id and the text language from the first part 2b) Get the object type from the second part
From the third part on we should have only text content with their part names Extraction of the information from the text: I am skipping the checks about whether it is a text line or a comment with # or line structure with $ here. Instead let's assume that we have normal lines starting with a number.
3) Map the text content to line numbers for each part. 4) Check if the text content have logogrammes. 5) Check if the logogrammes have empty space in them. 5a) Here i have not decided yet how i am going to handle the logogrammes that have spaces in them or that belong to a certain word 6) Split the text content into words. 7) Split the word content into signs. 8) Enumerations. At the last stage, i will start enumerating at first the signs for adding their relative positions in the word, then the words for their relative positions in the line, etc. I am not sure how many passages it would require for filling the schema provided in the gist, but in theory, something like this should work.
If you have any suggestions, or any ideas that might be more efficient, or if anything seems odd in the code for some reasons, feel free to correct me.