ploomber / soorgeon

Convert monolithic Jupyter notebooks 📙 into maintainable Ploomber pipelines. 📊
https://ploomber.io
Apache License 2.0
78 stars 20 forks source link

Initial validation #5

Closed edublancas closed 2 years ago

edublancas commented 2 years ago

We should validate the input notebook:

idomic commented 2 years ago

On the second bullet. @edublancas who defines what's a few? Also what's invalid in it that we need an error? Imo the user and maybe it's a valid notebook for them? I assume you meant H2s since we don't support H1s currently. Unless it's a 300 lines notebook and there's no or maybe 1-2 headers and then it make sense to output something to the user, but then it's pretty similar to https://github.com/ploomber/soorgeon/issues/47

edublancas commented 2 years ago

I'd say:

I updated the original description

94rain commented 2 years ago

I am working on implementing the following logic. Just to confirm if my understanding is correct.

I will create _get_h1_header method, similar to _get_h2_header. But I think I will need to change the regex in _get_h2_header to ^\s*##\s+(.+) (add ^). Otherwise \s*##\s+(.+) will match ###.

And the regex in the new method _get_h1_header will be ^\s*#\s+(.+).

I will start to write tests for test_get_h2_header and test_get_h1_header.

idomic commented 2 years ago

Sounds about right, Only one H2 header will be dealt in a different PR. The function sounds good, just validate the regex actually matches to H1s (I tried here on python and seems it doesn't https://regex101.com/).

94rain commented 2 years ago

What's the input that fails the match? I tried the following cases and they all passed

@pytest.mark.parametrize('md, expected', [
    ['# Header', 'header'],
    ['# H1\n## H2', 'h1'],
    [' \t #   H1', 'h1'],
    ['  ##   H2', None],
    ['something', None],
    ['# Something\nignore me', 'something'],
])
def test_get_h1_header(md, expected):
    assert split._get_h1_header(md) == expected
idomic commented 2 years ago

^\s*##\s+(.+)

Is it working with this pattern? Also I tried it online and not locally with a py console, that might be failing due to the parser.

94rain commented 2 years ago

Yes, it is.

What's the input? I tried ## 1 as the input online and it worked.

idomic commented 2 years ago

Small comment @94rain did we cover the case where no markdowns are found at all? Didn't see any in the test cases.

94rain commented 2 years ago

Does test_find_breaks_error_if_no_h1_and_h2_headers count? Expected notebook to have at least one markdown H2 heading. (No H1 and H2 headers)

This does not check whether or not there are other markdown texts.

idomic commented 2 years ago

Not really since there's still markdown in the string you're checking, try having a sample with a json or plain-text

94rain commented 2 years ago

Not really since there's still markdown in the string you're checking, try having a sample with a json or plain-text

OK. Just additional tests? Or do we expect the error message to be different to Expected notebook to have at least one markdown H2 heading.?

idomic commented 2 years ago

Yes exactly!