visit-dav / visit

VisIt - Visualization and Data Analysis for Mesh-based Scientific Data
https://visit.llnl.gov
BSD 3-Clause "New" or "Revised" License
406 stars 110 forks source link

Create crossed-referenced glossary #18737

Open markcmiller86 opened 1 year ago

markcmiller86 commented 1 year ago

Describe what needs to be documented.

Sometimes, finding the right place in the documentation requires knowing apriori what you are looking for.

If we had a complete glossary of terms which was also heavily cross-referenced to sections where those terms are used, this could help.

For example, a user looking for halo zone related stuff might search for halo and get nothing right now. But, if we had a glossary where we had...

ghost zones: In VisIt_ ghost zones refers to zones that exist within a given domain solely for supporting computations along the boundaries of the domain. Ghost zones are never actually rendered. Ghost zones are sometimes called halo zones or invisible zones or overlap zones.

You can find material regarding ghost zones at these sections in VisIt_ documentation: 1, 5.5, 11.2.1,

Then any user using different terminology could at least arrive here, confirm whether the term matches the definition and then look at those places in the manual.

markcmiller86 commented 10 months ago

So, I started down this path. But, maybe I've got the wrong thinking cap on.

My thinking was to get a list of all the words in VisIt docs that are unusual. So, I did some processing to emit all the unique words in VisIt docs....

cd docs
find . -name '*.rst' | xargs -I'{}' -n 1 sh -c "cat '{}' | grep -v '^.*\.\. _.*' | grep -v ".png" | tr '[:upper:]' '[:lower:]' | tr ' ({[<./_-:;%+=' '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'" | tr -d ',-`"./: ; ( ) {}[]+=-_#~ & %? < > | *^@$!' | tr -d "'" | sort | uniq > words.txt

The above creates a file with about 10.5K words, one word per line. However, about 10% of these are gobbledygook from reStructuredText formatting (e.g. anchor names and image names). No matter what I tried, I wasn't able to rid these from the output.

I then found a list of common English language words and createad a .txt file of those (not easy because of the HTML in which this list was embedded). There is a danger here in that a lot of the domain-specific terminology VisIt uses may indeed involve common English words (e.g. ghost).

I then removed the common English words from the unique words from VisIt's docs using...

grep -x -v -f common-englisth-words.txt words.txt > candidate-visit-terms.txt

common-english-words.txt.gz candidate-visit-terms.txt.gz

The next step is probably to go through the ~9.1K words in candidate-visit-terms.txt and remove the ones we don't think need to appear in a glossary. I started that and realized it would take a bit. Which is why I think I might have the wrong thinking cap on.

markcmiller86 commented 2 months ago

I made a pass over candidate-visit-terms and pared it down from 9.1K terms to about 3.1K terms. I think this list represents enough terminology to inform a decent pass at a glossary. Not all words in the file are themselves actual terms but they point to terms that should be defined.

final-candidate-visit-terms.txt

JustinPrivitera commented 2 months ago

Thanks Mark! I wonder how we should attack this... is this code sprint material?

markcmiller86 commented 2 months ago

is this code sprint material?

Yes. For sure