unified-doc-cli for programmatic manipulation of any file on the web

This idea will most likely be implemented in unified-doc-cli

Goals

The internet is a connection of files. unified-doc aims to bridge working with different files with unified document APIs. With a CLI implemented in unified-doc-cli, this will allow us to programmatically crawl/curl through web files and perform various useful processing on them e.g.

Searching for content
Sanitizing content
Extracting just the textContent (useful for NLP pipelines).
Parse to hast and continue content processing with hast utilities in the unified ecosystem.
Outputting source file in different format (.html, .txt, and eventually .pdf and .docx etc).
Enrich source file by attaching plugins, annotations etc.

Config file

Maybe a .unirc.js file? This config basically provides the input for unified-doc. You can attach/override default parsers/plugins/search-algorithms.

// default config
module.exports = {};  // just that!

// custom config
module.exports = {
  parsers: {
    docx: myDocxParser,
  },
  compiler: myCompiler,
  sanitizeSchema: mySanitizeSchema,
  searchAlgorithm: mySearchAlgorithm
}

CLI wrapper around API methods

The entry point for the CLI should be either:

a local filepath
web URL
string data

From this entry point, we can determine the content and filename accordingly.

CLI wrapper should intuitively wrap familiar API methods.

# output files (source, txt, html)
unified-doc https://some-webpage.html --file  # doc.file()
unified-doc https://some-webpage.html --file txt  # doc.file('.txt')
unified-doc https://some-webpage.html --file .html  # doc.file('.html')

# search file
unified-doc https://some-webpage.html --search 'spongebob'  --options ...  # doc.search('spongebob', options)

# text content
unified-doc https://some-webpage.html --text-content'  # doc.textContent()

# parse hast
unified-doc https://some-webpage.html --parse'  # doc.parse()

Ideally, CLI APIs should be pipeable, allowing shell scripting. I'm not great with shell commands, but some pseudocode to demonstrate the ideas:

unified-doc https://some-webpage.html --text-content'  > myfile.txt

# repipe search results as annotations to the same file, and save the final html file
unified-doc https://some-webpage.html --search 'spongebob'  >>> --annotate SEARCH_RESULTS >>> --file .html  # HTML file saved with annotations.

Bulk processing

The CLI should define a way to specify a glob pattern of webpages, crawl through them, and bulk process them, keeping track of errors and allowing a way to access processed files.

unified-doc / ideas