redhat-et / foundation-models-for-documentation

Improve ROSA customer experience (and customer retention) by leveraging foundation models to do “gpt-chat” style search of Red Hat customer documentation assets.
Other
26 stars 12 forks source link

Dataset: plain text version of the various data sources, generated directly from the source files #6

Closed codificat closed 1 year ago

codificat commented 1 year ago

Describe the solution you'd like

A mechanism to obtain (a set of) plain text (ascii files) directly from the source files from the various documentation sources around ROSA wherever possible, to avoid having to rely on text extraction from rendered websites or PDF files.

Describe alternatives you've considered

We have a collection of various PDF files, from which plain text can be extracted. However, standard text extration from PDF has some limitations / problems:

Additional context

Sources include:

codificat commented 1 year ago

An update about the research so far.

I initially started by using Pandoc on the different doc sources. However, Pandoc's plain text output is not ideal for our purposes:

I believe that the end goal is to get a plain text output that:

Searching for an approach that can satisfy these: I believe that going through an intermediate format might help.

WIP options being explored that include an intermediate format / process:

codificat commented 1 year ago

In the end we opted for Markdown as the plain text file format, as it meets the criteria highlighted in previous comments: it keeps URLs and document structure, and it is not polluted by rendering artifacts.

There is a copy of the (markdown) generated files from #14 for the ROSA documentation uploaded to the s3://ET-DS-FOUNDATION-MODELS bucket in PSI. Markdown versions of the ROSA workshop and MOBB website are directly available in data/external in this repo.

In general: the script in #14 should help us translate docs to markdown when needed, and the other sources we have so far are already in markdown format.

Is there anything more needed for this issue or should we close it?

I guess that a next step here is to have a pipeline that integrates data collection with training/fine-tuning, but I believe that would be a separate new issue instead.

Shreyanand commented 1 year ago

LGTM! Closed by #14