unified-doc / ideas

1 stars 0 forks source link

unified-doc-cli for programmatic manipulation of any file on the web #1

Open chrisrzhou opened 4 years ago

chrisrzhou commented 4 years ago

This idea will most likely be implemented in unified-doc-cli

Goals

The internet is a connection of files. unified-doc aims to bridge working with different files with unified document APIs. With a CLI implemented in unified-doc-cli, this will allow us to programmatically crawl/curl through web files and perform various useful processing on them e.g.

Config file

Maybe a .unirc.js file? This config basically provides the input for unified-doc. You can attach/override default parsers/plugins/search-algorithms.

// default config
module.exports = {};  // just that!

// custom config
module.exports = {
  parsers: {
    docx: myDocxParser,
  },
  compiler: myCompiler,
  sanitizeSchema: mySanitizeSchema,
  searchAlgorithm: mySearchAlgorithm
}

CLI wrapper around API methods

The entry point for the CLI should be either:

From this entry point, we can determine the content and filename accordingly.

CLI wrapper should intuitively wrap familiar API methods.

# output files (source, txt, html)
unified-doc https://some-webpage.html --file  # doc.file()
unified-doc https://some-webpage.html --file txt  # doc.file('.txt')
unified-doc https://some-webpage.html --file .html  # doc.file('.html')

# search file
unified-doc https://some-webpage.html --search 'spongebob'  --options ...  # doc.search('spongebob', options)

# text content
unified-doc https://some-webpage.html --text-content'  # doc.textContent()

# parse hast
unified-doc https://some-webpage.html --parse'  # doc.parse()

Ideally, CLI APIs should be pipeable, allowing shell scripting. I'm not great with shell commands, but some pseudocode to demonstrate the ideas:

unified-doc https://some-webpage.html --text-content'  > myfile.txt

# repipe search results as annotations to the same file, and save the final html file
unified-doc https://some-webpage.html --search 'spongebob'  >>> --annotate SEARCH_RESULTS >>> --file .html  # HTML file saved with annotations.

Bulk processing

The CLI should define a way to specify a glob pattern of webpages, crawl through them, and bulk process them, keeping track of errors and allowing a way to access processed files.

chrisrzhou commented 4 years ago

This part of the project excites me the most, given it's immediate value once implemented.

Unfortunately I have a non-existent experience with writing CLI libraries. I would be tackling this in the future and ramping up on my personal knowledge, but any help/advice from the community is greatly appreciated here.