hcarter333 / ham-radio-freedom

0 stars 0 forks source link

Create python question parser (again) #2

Closed hcarter333 closed 1 year ago

hcarter333 commented 1 year ago

Build a parser that does the following:

**Isolates all questions out of the pool using sed to start at lines that match ^T\d[A-Za-z]\d\d (For the general class pool use 'G' instead of the 'T' shown above) until a line matching ^~~ is found

Pulls out answer Found with ([A-D])

Collapses question and answers into single line matching the following example format 1|T|1|A|01|D|[97.3(a)(4)] For whom is the Amateur Radio Service ....

2|T|1|A|02|C|[97.1] What agency regulates an.....

Where the first field is a unique number in sequence, the second is T for technician exam, the following three are deconstructed from the first line of the question and indicate group|subgroup|correct_answer, and finally the last five fields are the question and all four possible answers. All questions go on the same line for the moment.

Change all double quotes to backslash_double_quote style Use sed for this as well

Consrtruct a test case that loads the resulting line into a javascript in an html page and passes if there are no errors.

hcarter333 commented 1 year ago

From https://unix.stackexchange.com/questions/236751/how-to-grep-lines-between-start-and-end-pattern use

sed -n '/aaa/,/cdn/p' file

to grab lines between two patterns.

So, for ^T\d[A-Za-z\d\d until a line matching ^~~

sed -n '/^T[0-9][A-Za-z][0-9][0-9]/,/^\~\~/p' tech_raw.txt

Has been testsd, and leaves us in a good place for groups for the moment.

hcarter333 commented 1 year ago

Using groups to get the first several fields:

sed -n '/^\(T\)\([0-9]\)\([A-Z]\)\([0-9][0-9]\).*\(([A-D])\)/,/^\~\~/p' tech_raw.txt | sed 's/^\(T\)\([0-9]\)\([A-Z]\)\([0-9][0-9]\).*(\([A-D]\))/\&\1|\2|\3|\4|\5|/g'

Then used nedit along with Excel to accomplish the rest of the formatting by first isolating a question and its associated answers per line in nedit, then using Excel to add the unique identifiers at the start of each question. Finally, did a small amount of debugging to get rid of semicolons and single ticks that were embedded in question. Finally, replaced the figure references. image Still need to resize the figures, but this is low priority.

hcarter333 commented 10 months ago

This looks like the full answer. It uses tr (translate) to swap all \n (linefeed) characters for | separators. Turns out sed isn't suited for this because it works on a line of input at a time. The tr suggestion comes from the previous link. Then, we use sed one last time to patch up the resulting '|&' combinations at the ends of questions.

sed -n '/^\(G\)\([0-9]\)\([A-Z]\)\([0-9][0-9]\).*\(([A-D])\)/,/^\~\~/p' gen_pool.txt | sed 's/^\(G\)\([0-9]\)\([A-Z]\)\([0-9][0-9]\).*(\([A-D]\))/\&\1|\2|\3|\4|\5/g' | sed 's/\~\~//g' | tr -s '\n' | tr '\n' '|' | sed 's/|&/\&/g'

hcarter333 commented 10 months ago

Daize came up with a different answer that I like better:

sed -n '/^\(G\)\([0-9]\)\([A-Z]\)\([0-9][0-9]\).*\(([A-D])\)/,/^\~\~/p' gen_pool.txt | sed 's/^\(G\)\([0-9]\)\([A-Z]\)\([0-9][0-9]\).*(\([A-D]\))/\&\1|\2|\3|\4|\5|/g' | sed 's/\~\~//g' | sed 's/\(^[A-D]\..*\)/|\1/g' | tr -d '\n'

hcarter333 commented 10 months ago

OK, the previous comment didn't put in line numbers. Doing this with the output of that command, (gen_line.txt) wraps things up:

tr '&' '\n' < gen_line.txt | awk '{ printf "&%03d|%s", NR, $0 }' > gen_numbered_line.txt

hcarter333 commented 10 months ago

Be sure to remove the leading & before the first question before executing the above.