Pluggable DSL that uses pipes to perform a series of linear transformations to extract data. Provides several Python built-ins, list, string methods, JMESPath, XPath functions as well as allows creation of custom functions.
Clear semantics makes for code that is easily understandable by non-technical business users and analysts. Also the grammar is terse enough for automatic generation through user guided interactions.
Pure Python
Directly translated or compiled to JS. Enables developer to debug pipe in the browser.
Encompasses Python built-ins, list and string methods, cherry-picked functions from Cypher, JMESPath, XPath etc.
Proof of concept built using pipe
Switch to using Lark, shlex, Ply, Antlr, TextX, PyParsing or another alternative.
Here is a typical Python code to transform data extracted from a page
values = map(clean_text, values)
values = filter(None, values)
text = ' '.join(values)
text = text[:-1] if text[-1] == '.' else text
text = re.sub(r'(active|other|inactive)\s+ingredients?\s*:\s*', '', text, flags=re.IGNORECASE)
text = re.sub(r'\)\.\s*', '), ', text)
text = re.sub(r'\s+\(', ' (', text)
Here is the code in shublang
data
| sanitize
| sub(r'(Active|Other|Inactive)\s+Ingredients?\s*:\s*', '')
| sub(r'\)\.\s*', '), ')
| sub(r'\s+\(', ' (',)
| sub(r'.$', '',)
| first
Pipes are useful for rewriting a fairly short linear sequence of operations.
Pipes should ideally not be longer than 10 steps. In cases where pipes are exceeding this, create custom functions for the intermediate processing steps
Pipe should transform a single primary object that return a single output. If there are multiple inputs/objects being combined together, do not use a pipe.
Pipes are linear and expressing complex directed graph like relationships will result in convoluted code
General rule of thumb is to think of Shublang pipes as the code that is written inside Scrapy item loaders or pipelines.
Shublang will look to support the following language features and functions.
Logical and Execution Flow Constructs |
---|
skip_while |
take_while |
where |
select |
skip |
Predicate Functions |
---|
all |
any |
exists |
none |
Scalar Functions |
---|
length |
bool |
float |
int |
timestamp |
Aggregating Functions |
---|
avg |
max |
min |
sum |
aggregate |
groupby |
List Functions |
---|
range |
count |
reverse |
map |
filter |
sort |
slice |
chain |
chain_with |
tail |
first |
Mathematical Functions |
---|
abs |
ceil |
floor |
round |
String Functions |
---|
format |
join |
split |
find |
capitalize |
strip |
sub |
replace |
startswith |
endswith |
encode |
decode |
isdigit |
isdecimal |
rstrip |
lstrip |
re_search |
Temporal Functions |
---|
date_format |
HTML and JSON Functions |
---|
jmespath |
json_loads |
sanitize |
xpath_getall |
xpath_get |
css_getall |
css_get |
Shublang provides a command line utility to verify an expression
$ shublang 'add' [1,2]
3
$ shublang "xpath_get('//script[contains(., \"pidData\")]/text()') \
|re_search('\"pidData\":({.*}),\"msgs\":')|first|json_loads|jmespath('pid')|first" \
--url https://www.crocs.com/p/classic-clog/10001.html
10001