Closed hcarter333 closed 1 year ago
From https://unix.stackexchange.com/questions/236751/how-to-grep-lines-between-start-and-end-pattern use
sed -n '/aaa/,/cdn/p' file
to grab lines between two patterns.
So, for ^T\d[A-Za-z\d\d until a line matching ^~~
sed -n '/^T[0-9][A-Za-z][0-9][0-9]/,/^\~\~/p' tech_raw.txt
Has been testsd, and leaves us in a good place for groups for the moment.
Using groups to get the first several fields:
sed -n '/^\(T\)\([0-9]\)\([A-Z]\)\([0-9][0-9]\).*\(([A-D])\)/,/^\~\~/p' tech_raw.txt | sed 's/^\(T\)\([0-9]\)\([A-Z]\)\([0-9][0-9]\).*(\([A-D]\))/\&\1|\2|\3|\4|\5|/g'
Then used nedit along with Excel to accomplish the rest of the formatting by first isolating a question and its associated answers per line in nedit, then using Excel to add the unique identifiers at the start of each question. Finally, did a small amount of debugging to get rid of semicolons and single ticks that were embedded in question. Finally, replaced the figure references. Still need to resize the figures, but this is low priority.
This looks like the full answer. It uses tr (translate) to swap all \n (linefeed) characters for | separators. Turns out sed isn't suited for this because it works on a line of input at a time. The tr suggestion comes from the previous link. Then, we use sed one last time to patch up the resulting '|&' combinations at the ends of questions.
sed -n '/^\(G\)\([0-9]\)\([A-Z]\)\([0-9][0-9]\).*\(([A-D])\)/,/^\~\~/p' gen_pool.txt | sed 's/^\(G\)\([0-9]\)\([A-Z]\)\([0-9][0-9]\).*(\([A-D]\))/\&\1|\2|\3|\4|\5/g' | sed 's/\~\~//g' | tr -s '\n' | tr '\n' '|' | sed 's/|&/\&/g'
Daize came up with a different answer that I like better:
sed -n '/^\(G\)\([0-9]\)\([A-Z]\)\([0-9][0-9]\).*\(([A-D])\)/,/^\~\~/p' gen_pool.txt | sed 's/^\(G\)\([0-9]\)\([A-Z]\)\([0-9][0-9]\).*(\([A-D]\))/\&\1|\2|\3|\4|\5|/g' | sed 's/\~\~//g' | sed 's/\(^[A-D]\..*\)/|\1/g' | tr -d '\n'
OK, the previous comment didn't put in line numbers. Doing this with the output of that command, (gen_line.txt) wraps things up:
tr '&' '\n' < gen_line.txt | awk '{ printf "&%03d|%s", NR, $0 }' > gen_numbered_line.txt
Be sure to remove the leading & before the first question before executing the above.
Build a parser that does the following:
**Isolates all questions out of the pool using sed to start at lines that match ^T\d[A-Za-z]\d\d (For the general class pool use 'G' instead of the 'T' shown above) until a line matching ^~~ is found
Pulls out answer Found with ([A-D])
Collapses question and answers into single line matching the following example format 1|T|1|A|01|D|[97.3(a)(4)] For whom is the Amateur Radio Service ....
2|T|1|A|02|C|[97.1] What agency regulates an.....
Where the first field is a unique number in sequence, the second is T for technician exam, the following three are deconstructed from the first line of the question and indicate group|subgroup|correct_answer, and finally the last five fields are the question and all four possible answers. All questions go on the same line for the moment.
Change all double quotes to backslash_double_quote style Use sed for this as well
Consrtruct a test case that loads the resulting line into a javascript in an html page and passes if there are no errors.