add regression tests - Githubissues

scrapinghub / shublang

Pluggable DSL that uses pipes to perform a series of linear transformations to extract data

BSD 3-Clause "New" or "Revised" License

15 stars 8 forks source link

This test need not be merged into master but instead provides the necessary stimuli to see some pain points in shublang's usage.

In particular we can see that:

The current sanitize functionality returns empty strings in its iterable. This presents the need to update it to prune out the empty strings, otherwise it would evaluate our test example as ['', '', '', 'price: $123,823.00', '']
We need to do a double first, since the 1st one transforms [('123823.00',)] into ('123823.00',) and the 2nd one transforms (123823.00,) into 123823.00.
The float functionality needs to be in between the double first since it only works on iterables.

As we can see, we need to jump on a lot of hoops just to properly extract this type of data.

Ideally, we should have a way to extract the data in a very concise manner like: re_search("(\d+\.\d{2}) | first_match

I agree with the need for simplicity/concision at the necessary pipes to extract the data in the provided example above. Given the current grammar, from the "logical" point of view makes sense to use the first twice, but analyzing it from the user perspective, it could be weirdy, or at least, verbose.

Another thing that I found exploring this example is that the first will fail if the re_search returns None. Also, if we try to apply the float to an empty value we will get an exception too because it expects an iterable. But I think that in these cases we can avoid breaking things evaluating the expressions inside a try/catch (if an exception is thrown we return None), in the same way that the universal parser extractor does.

scrapinghub / shublang

add regression tests #52