trevorcampbell / website_diff

MIT License
2 stars 1 forks source link

Bug in crawler? #20

Closed trevorcampbell closed 3 weeks ago

trevorcampbell commented 3 weeks ago

See the GH actions run here: https://github.com/joelostblom/viz-oer/actions/runs/10729466334/job/29756120813#step:9:179

I've asked @joelostblom to provide the old/new websites that caused that error, and will report back here once I figure out what the underlying issue is.

trevorcampbell commented 3 weeks ago

One can reproduce the bug by checking out

https://github.com/joelostblom/viz-oer/tree/bedef6d1707f274f88fa996338d64e4f3eb390cf

and running

website_diff --old . --new pull24/textbook/_book --diff diff24

The bug is due to an <a> tag with the form

<a aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" class="flex-grow-1" data-bs-target=".quarto-sidebar-collapse-item" data-bs-toggle="collapse" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }" role="navigation">
</a>

There is no href property, so line 46 in crawler.py gets the href property, which returns none, and then the code fails to extract the anchor from a None object.

Solution is to just avoid None objects after trying to get an href.