Dataset: plain text version of the various data sources, generated directly from the source files

codificat commented 1 year ago

Describe the solution you'd like

A mechanism to obtain (a set of) plain text (ascii files) directly from the source files from the various documentation sources around ROSA wherever possible, to avoid having to rely on text extraction from rendered websites or PDF files.

Describe alternatives you've considered

We have a collection of various PDF files, from which plain text can be extracted. However, standard text extration from PDF has some limitations / problems:

hyperlinks are not visible
document structure is harder to retain
the PDF renderer introduces noise (page headers/footers, numbered references, etc)
the PDFs are harder to refresh / keep up to date with the current docs

Additional context

Sources include:

[ ] OpenShift docs repo: https://github.com/openshift/openshift-docs (asciidoc)
- this includes ROSA product documentation, and also all OpenShift product documentation
[ ] ROSA Workshop: https://github.com/openshift-cs/rosaworkshop (markdown)
[ ] MOBB: https://github.com/rh-mobb/documentation (markdown)

codificat commented 1 year ago

An update about the research so far.

I initially started by using Pandoc on the different doc sources. However, Pandoc's plain text output is not ideal for our purposes:

URLs from links get removed
Document structure (headings, lists, code) is not reflected in a "tokenizable" way
little control of the output, making above points worse

I believe that the end goal is to get a plain text output that:

includes URLs where relevant
can help identify the document structure
is not "polluted" by the rendering process (e.g. no TOC, page numbers, irrelevant headings, code annotations, etc.)

Searching for an approach that can satisfy these: I believe that going through an intermediate format might help.

WIP options being explored that include an intermediate format / process:

For the AsciiDoc sources: maybe man page?

asciidoctor -b manpage rosa-get-started-cli.adoc
man -l rosa-get-started-cli.1 > rosa-get-started-cli.txt

For the Markdown sources: maybe Texinfo?

for file in *.md; do
    base=$(basename $file .md)
    pandoc -t texinfo -o $base.texi --columns=666 $file
    makeinfo --plaintext --no-number-sections --no-split -o $base.txt $base.texi
done

I hit a problem with this while trying to process the rosaworshop md files:

16-deploy_ui.texi:7: raising the section level of @section which is too low
17-simple_deploy.texi:7: raising the section level of @section which is too low
1-account_setup.texi:7: raising the section level of @section which is too low
1-account_setup.texi:39: raising the section level of @subsection which is too low
2-deploy.texi:228: @node `Create account roles' previously defined
2-deploy.texi:51: here is the previous definition as @node

Personally I would try to use org-mode as that intermediate format, but then Emacs would have to be involved in the production of the final text, and I don't know how easy that is to incorporate into the AI/ML pipeline when time comes.

codificat commented 1 year ago

In the end we opted for Markdown as the plain text file format, as it meets the criteria highlighted in previous comments: it keeps URLs and document structure, and it is not polluted by rendering artifacts.

There is a copy of the (markdown) generated files from #14 for the ROSA documentation uploaded to the s3://ET-DS-FOUNDATION-MODELS bucket in PSI. Markdown versions of the ROSA workshop and MOBB website are directly available in data/external in this repo.

In general: the script in #14 should help us translate docs to markdown when needed, and the other sources we have so far are already in markdown format.

Is there anything more needed for this issue or should we close it?

I guess that a next step here is to have a pipeline that integrates data collection with training/fine-tuning, but I believe that would be a separate new issue instead.

Shreyanand commented 1 year ago

LGTM! Closed by #14

redhat-et / foundation-models-for-documentation