universal-ctags / ctags

A maintained ctags implementation
https://ctags.io
GNU General Public License v2.0
6.48k stars 620 forks source link

Defining parser for ENDF file format #2622

Closed jlconlin closed 4 years ago

jlconlin commented 4 years ago

I have a file format that doesn't have a parser (yet). I'd like to (if I can), write a parser so that I can use existing text-editor tools to naturally move through the file. I'd be willing to do the work, but I'm not sure where to start. There are no keywords for this as it is not a computer language. I've written a simple syntax and folding definition for the Vim editor. Not sure if that helps or not.

The different sections of the file are determined based on the content of the last ten columns of each line. (I didn't create the format. Sorry.) here is a sample:

                                                                  MMMMFFTTT
                               33        856        176          17434 1451
                               34          2        155          17434 1451
                               34         51        115          17434 1451
 0.000000+0 0.000000+0          0          0          0          07434 1  0
 0.000000+0 0.000000+0          0          0          0          07434 0  0
 7.418300+4 1.813790+2          0          0          1          07434 2151
 7.418300+4 1.000000+0          0          0          2          07434 2151
 1.000000-5 5.000000+3          1          7          0          17434 2151
 0.000000+0 0.000000+0          0          3          5          07434 2151
 0.000000+0 0.000000+0          2          0         24          47434 2151
 7.418300+4 1.813790+2          0          0          0          07434 3 28
-7.222000+6-7.222000+6          0          0          1         397434 3 28
         39          2                                            7434 3 28
 7.261820+6 0.000000+0 9.300000+6 0.000000+0 9.600000+6 2.18585-137434 3 28
 1.000000+7 5.01372-13 1.050000+7 1.32071-11 1.100000+7 8.70475-107434 3 28
 0.000000+0 0.000000+0          0          0          0          07434 3  0
 7.418300+4 1.813790+2          0          0          0          07434 3 37
-2.093600+7-2.093600+7          0          0          1         207434 3 37
 2.105140+7 0.000000+0 2.200000+7 7.150990-5 2.400000+7 2.707920-27434 3 37
 1.300000+8 5.411910-2 1.500000+8 3.895580-2                      7434 3 37
 0.000000+0 0.000000+0          0          0          0          07434 3  0
 7.418300+4 1.813790+2          0          0          0          07434 3 41
-1.328500+7-1.328500+7          0          0          1         267434 3 41
         26          2                                            7434 3 41
 1.335820+7 0.000000+0 1.550000+7 0.000000+0 1.600000+7 2.56183-147434 3 41
 1.700000+7 9.60380-12 1.800000+7 3.02742-10 1.900000+7 1.474340-77434 3 41
 1.300000+8 1.582280-2 1.500000+8 1.154350-2                      7434 3 41

I've labeled the columns MMMM, FF, and TT. When these change is when I need a "tag" (using the term loosely) to tell me that it has changed. Note, this is (kind of) nested in that, there are many TTs in each FF, and many FFs inside each MMMM.

I've attached an example file that contains a full example.

n-000_n_001.endf.txt

masatake commented 4 years ago

The format definition: https://www.nndc.bnl.gov/csewg/docs/endf-manual.pdf

jlconlin commented 4 years ago

The format definition: https://www.nndc.bnl.gov/csewg/docs/endf-manual.pdf

Yes that is the format definition. The parser does not have to generate tags for all of it right now. We can add to it as needed. First is just needed to know where MMMM, FF, and TT change.

masatake commented 4 years ago

Do you know any popular programming language like C? Could you tell me one of what you knows? I would like to use it as an example to explain tags output.

masatake commented 4 years ago

You may know TeX. I will use it.

jlconlin commented 4 years ago

I'm mostly familiar with C++, Python, and LaTeX. I can do C, but I don't like to.

masatake commented 4 years ago

input.tex:

\section{A}
...
\subsection{B}
...
\subsubsection{C}

For the above input, ctags can generate following tags file:

$ u-ctags -o - --fields=+K-l /tmp/input.tex 
A   input.tex   /^\\section{A}$/;"  section
B   input.tex   /^\\subsection{B}$/;"   subsection  section:A
C   input.tex   /^\\subsubsection{C}$/;"    subsubsection   subsection:A""B

For input.endf:

                               33        856        176          17434 1451
                               34          2        155          17434 1451
                               34         51        115          17434 1451
 0.000000+0 0.000000+0          0          0          0          07434 1  0

what kind of tags output do you want? My guessing:

17434   input.endf  /^                               33        856        176          17434 1451$/;"   mmmm
12  input.endf  /^                               33        856        176          17434 1451$/;"   ff  mmmm:17343
51  input.endf  /^                               33        856        176          17434 1451$/;"   tt  ff:1734312
07434   input.endf  /^ 0.000000+0 0.000000+0          0          0          0          07434 1  0$/;"   mmmm
1\  input.endf  /^ 0.000000+0 0.000000+0          0          0          0          07434 1  0$/;"   ff  mmmm:07434
\ 0 input.endf  /^ 0.000000+0 0.000000+0          0          0          0          07434 1  0$/;"   tt ff:074341\ 
masatake commented 4 years ago
12  input.endf  /^                               33        856        176          17434 1451$/;"   ff  mmmm:17343
51  input.endf  /^                               33        856        176          17434 1451$/;"   tt  ff:1734312

This includes a typo. What I would like to write is:

14  input.endf  /^                               33        856        176          17434 1451$/;"   ff  mmmm:17343
51  input.endf  /^                               33        856        176          17434 1451$/;"   tt  ff:1734314
masatake commented 4 years ago

As far as reading "Table 1: Key parameters defining the hierarchy of entries in an ENDFfile", mmmm, ff and tt may not be good as the name of kinds.

mat (material) may be better than mmm. mf (material file) may be better than ff. mt (material subdivision) may be better than tt.

masatake commented 4 years ago

I wonder how "1 " and " 0" should be tagged. Can we tag them as "1" and "0"? Whether the prefixed and suffixed white space character should be kept or not.

masatake commented 4 years ago

I found more typos. 17434 should be 7434. 07434 should be 7434.

masatake commented 4 years ago

Based on the guessing I wrote a parser. \x20 at the beginning of lines means a white space char.

input: n-000_n_001.endf.txt

[yamato@control]/tmp% u-ctags --fields=+K-l  --sort=no -o - input.endf 
u-ctags --fields=+K-l  --sort=no -o - input.endf 
\x20 1  input.endf  /^ $Rev::          $  $Date::            $                             1 0  0$/;"   mat
0   input.endf  /^ $Rev::          $  $Date::            $                             1 0  0$/;"   mf  mat:  1 
\x200   input.endf  /^ $Rev::          $  $Date::            $                             1 0  0$/;"   mt  mf:  1 0 
\x2025  input.endf  /^ 1.000000+0 1.000000+0          0          0          0          2  25 1451$/;"   mat
14  input.endf  /^ 1.000000+0 1.000000+0          0          0          0          2  25 1451$/;"   mf  mat: 25 
51  input.endf  /^ 1.000000+0 1.000000+0          0          0          0          2  25 1451$/;"   mt  mf: 25 14
1   input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 1  0$/;"   mf  mat: 25 
\x200   input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 1  0$/;"   mt  mf: 25 1 
0   input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 0  0$/;"   mf  mat: 25 
\x200   input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 0  0$/;"   mt  mf: 25 0 
21  input.endf  /^ 1.000000+0 1.000000+0          0          0          1          0  25 2151$/;"   mf  mat: 25 
51  input.endf  /^ 1.000000+0 1.000000+0          0          0          1          0  25 2151$/;"   mt  mf: 25 21
2   input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 2  0$/;"   mf  mat: 25 
\x200   input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 2  0$/;"   mt  mf: 25 2 
0   input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 0  0$/;"   mf  mat: 25 
\x200   input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 0  0$/;"   mt  mf: 25 0 
3   input.endf  /^ 1.000000+0 1.000000+0          0          0          0          0  25 3  1$/;"   mf  mat: 25 
\x201   input.endf  /^ 1.000000+0 1.000000+0          0          0          0          0  25 3  1$/;"   mt  mf: 25 3 
\x200   input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 3  0$/;"   mt  mf: 25 3 
\x202   input.endf  /^ 1.000000+0 1.000000+0          0          0          0          0  25 3  2$/;"   mt  mf: 25 3 
\x200   input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 3  0$/;"   mt  mf: 25 3 
0   input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 0  0$/;"   mf  mat: 25 
\x200   input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 0  0$/;"   mt  mf: 25 0 
4   input.endf  /^ 1.000000+0 1.000000+0          0          1          0          0  25 4  2$/;"   mf  mat: 25 
\x202   input.endf  /^ 1.000000+0 1.000000+0          0          1          0          0  25 4  2$/;"   mt  mf: 25 4 
\x200   input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 4  0$/;"   mt  mf: 25 4 
0   input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 0  0$/;"   mf  mat: 25 
\x200   input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 0  0$/;"   mt  mf: 25 0 
\x20 0  input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0   0 0  0$/;"   mat
0   input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0   0 0  0$/;"   mf  mat:  0 
\x200   input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0   0 0  0$/;"   mt  mf:  0 0 
\x20-1  input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  -1 0  0$/;"   mat
0   input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  -1 0  0$/;"   mf  mat: -1 
\x200   input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  -1 0  0$/;"   mt  mf: -1 0 
jlconlin commented 4 years ago

As far as reading "Table 1: Key parameters defining the hierarchy of entries in an ENDFfile", mmmm, ff and tt may not be good as the name of kinds.

mat (material) may be better than mmm. mf (material file) may be better than ff. mt (material subdivision) may be better than tt.

Yes, you are right, mat, mf, and mt are the correct names for the hierarchy. I was trying not to get too much into the details. I'm impressed you were able to dig through that large document and find the important stuff. Thanks!

So when mat, mf, or mt turns to 0, that just means that that it is the last line of the material/file/section. I don't know if that means that you need a new tag. I'm still new to all of this.

Also, sometimes there are additional numbers beyond mt that are optional. These can be up to 5 digits in length.

jlconlin commented 4 years ago

Based on the guessing I wrote a parser. \x20 at the beginning of lines means a white space char.

input: n-000_n_001.endf.txt

[yamato@control]/tmp% u-ctags --fields=+K-l  --sort=no -o - input.endf 
u-ctags --fields=+K-l  --sort=no -o - input.endf 
\x20 1    input.endf  /^ $Rev::          $  $Date::            $                             1 0  0$/;"   mat
0     input.endf  /^ $Rev::          $  $Date::            $                             1 0  0$/;"   mf  mat:  1 
\x200 input.endf  /^ $Rev::          $  $Date::            $                             1 0  0$/;"   mt  mf:  1 0 
\x2025    input.endf  /^ 1.000000+0 1.000000+0          0          0          0          2  25 1451$/;"   mat
14    input.endf  /^ 1.000000+0 1.000000+0          0          0          0          2  25 1451$/;"   mf  mat: 25 
51    input.endf  /^ 1.000000+0 1.000000+0          0          0          0          2  25 1451$/;"   mt  mf: 25 14
1     input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 1  0$/;"   mf  mat: 25 
\x200 input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 1  0$/;"   mt  mf: 25 1 
0     input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 0  0$/;"   mf  mat: 25 
\x200 input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 0  0$/;"   mt  mf: 25 0 
21    input.endf  /^ 1.000000+0 1.000000+0          0          0          1          0  25 2151$/;"   mf  mat: 25 
51    input.endf  /^ 1.000000+0 1.000000+0          0          0          1          0  25 2151$/;"   mt  mf: 25 21
2     input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 2  0$/;"   mf  mat: 25 
\x200 input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 2  0$/;"   mt  mf: 25 2 
0     input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 0  0$/;"   mf  mat: 25 
\x200 input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 0  0$/;"   mt  mf: 25 0 
3     input.endf  /^ 1.000000+0 1.000000+0          0          0          0          0  25 3  1$/;"   mf  mat: 25 
\x201 input.endf  /^ 1.000000+0 1.000000+0          0          0          0          0  25 3  1$/;"   mt  mf: 25 3 
\x200 input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 3  0$/;"   mt  mf: 25 3 
\x202 input.endf  /^ 1.000000+0 1.000000+0          0          0          0          0  25 3  2$/;"   mt  mf: 25 3 
\x200 input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 3  0$/;"   mt  mf: 25 3 
0     input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 0  0$/;"   mf  mat: 25 
\x200 input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 0  0$/;"   mt  mf: 25 0 
4     input.endf  /^ 1.000000+0 1.000000+0          0          1          0          0  25 4  2$/;"   mf  mat: 25 
\x202 input.endf  /^ 1.000000+0 1.000000+0          0          1          0          0  25 4  2$/;"   mt  mf: 25 4 
\x200 input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 4  0$/;"   mt  mf: 25 4 
0     input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 0  0$/;"   mf  mat: 25 
\x200 input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  25 0  0$/;"   mt  mf: 25 0 
\x20 0    input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0   0 0  0$/;"   mat
0     input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0   0 0  0$/;"   mf  mat:  0 
\x200 input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0   0 0  0$/;"   mt  mf:  0 0 
\x20-1    input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  -1 0  0$/;"   mat
0     input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  -1 0  0$/;"   mf  mat: -1 
\x200 input.endf  /^ 0.000000+0 0.000000+0          0          0          0          0  -1 0  0$/;"   mt  mf: -1 0 

So, I'm too much of a novice to fully comprehend what those tags mean. There should be three "metadata" for every line, mat, mf, mt. I don't know if a tag needs to be generated for every line, or only when the section changes.

masatake commented 4 years ago

About the output, see tags(5) man page (https://docs.ctags.io/en/latest/man/tags.5.html).

I don't know if a tag needs to be generated for every line, or only when the section changes.

I also don't know that. I can help you write a parser. However, I cannot help you know what you want because I don't know ENDF format well, not only about syntax bout also about purpose. It is very up to how you use the tags output. I guess you may want to navigate the files on vim. That means the knowledge of vim is needed. However, I don't know well about vim.

Here is the parser I wrote. There are some ways to write a parser in ctags. This one is categorized to "line oriented parser written in C".

https://github.com/masatake/ctags/commit/e8e0015393ae7a3b447ee886bd0884f45d11ced2?branch=e8e0015393ae7a3b447ee886bd0884f45d11ced2&diff=unified

You can edit the parser as you want.

jlconlin commented 4 years ago

Here is the parser I wrote. There are some ways to write a parser in ctags. This one is categorized to "line oriented parser written in C".

masatake@e8e0015?branch=e8e0015393ae7a3b447ee886bd0884f45d11ced2&diff=unified

You can edit the parser as you want.

That looks great! If I understand C well enough, you are looking at the end of the file and updating the value of mat, mf, and mt. The only thing I would change is that mt is three digits in length; you only have it as two.

I'll clone it to my space, make the change, and see if I can't get it to work.

masatake commented 4 years ago

That looks great! If I understand C well enough, you are looking at the end of the file and updating the value of mat, mf, and mt. Yes.

The only thing I would change is that mt is three digits in length; you only have it as two.

Oh, sorry.

I'll clone it to my space, make the change, and see if I can't get it to work.

o.k. Feel free to reopen this if you need.