Add some sample plain text data

codificat commented 1 year ago

Related to #6.

This adds a few sample plain text files for initial testing. These are the contents of the mobb.ninja and ROSA worshop sites, converted to plain text from their respective markdown sources using pandoc:

for file in *.md; do
    pandoc -t plain -o $(basename $file .md).txt --columns=666 $file
done

There are known issues in these files, though. In particular, this conversion looses the URLs in links.

Planning to address this in the future by using an intermediate format as part of a more elaborate conversion process.

suppathak commented 1 year ago

Looks great Pep!

Do we want to use Github for storing raw data? Although it seems that these files aren't large in size but we could think of alternatives.

+1

codificat commented 1 year ago

Can we add more on how to reproduce this data collection?

For the sample data in this PR, it was basically the for mentioned in the PR description: just cloned the repos linked from the description and ran pandoc on the Markdown files within their ROSA content sections.

This is meant to just be a few sample files for initial / local / quick tests. Proper data collection belongs to #6

could we use the pandoc api that does it in the python environment?

I have been looking at pandoc's plain text writer and I am not convinced it's the best solution:

URLs are deleted
Document structure (headings, lists, code) is not reflected in a "tokenizable" way
not much room to control / customize the output as far as I can tell, which makes the two points above more of a problem

It was used for the data here because it was the most straightforward way, but for #6 I think there will be an intermediate format involved, and the final conversion might not use pandoc.

Do we want to use Github for storing raw data? Although it seems that these files aren't large in size but we could think of alternatives. @MichaelClifford what would your opinion on where to store this data would be?

What I had in mind for the "real data" (#6) was not to store it at all: I believe the dataset is small enough and the processing is quick enough that it can be retrieved fresh every time it is needed.

Again, the goal behind this PR is not to provide that "real data", but a stop-gap or lightweight mini-dataset that can help test/dev without having to process data (e.g. external repos or PDFs)

We can discard this if you prefer.

MichaelClifford commented 1 year ago

I'm generally against storing data in Github. But as @codificat mentioned this is a small experimental dataset. And having it in the repo probably makes it the simplest to access and maintain. So, I support keeping it here.

redhat-et / foundation-models-for-documentation

Add some sample plain text data #11