Open chrisrzhou opened 4 years ago
This part of the project excites me the most, given it's immediate value once implemented.
Unfortunately I have a non-existent experience with writing CLI libraries. I would be tackling this in the future and ramping up on my personal knowledge, but any help/advice from the community is greatly appreciated here.
This idea will most likely be implemented in
unified-doc-cli
Goals
The internet is a connection of files.
unified-doc
aims to bridge working with different files with unified document APIs. With a CLI implemented inunified-doc-cli
, this will allow us to programmatically crawl/curl through web files and perform various useful processing on them e.g.textContent
(useful for NLP pipelines).hast
and continue content processing withhast
utilities in theunified
ecosystem..html
,.txt
, and eventually.pdf
and.docx
etc).Config file
Maybe a
.unirc.js
file? This config basically provides the input forunified-doc
. You can attach/override default parsers/plugins/search-algorithms.CLI wrapper around API methods
The entry point for the CLI should be either:
From this entry point, we can determine the
content
andfilename
accordingly.CLI wrapper should intuitively wrap familiar API methods.
Ideally, CLI APIs should be pipeable, allowing shell scripting. I'm not great with shell commands, but some pseudocode to demonstrate the ideas:
Bulk processing
The CLI should define a way to specify a glob pattern of webpages, crawl through them, and bulk process them, keeping track of errors and allowing a way to access processed files.