michardy / sources-bot

Bot that identifies other sources of news for a story.
0 stars 0 forks source link

The BBC's royal wedding coverage breaks everthing #21

Closed michardy closed 6 years ago

michardy commented 6 years ago

Do I fix it or assume it will go away?

Traceback (most recent call last):
  File "main.py", line 524, in <module>
    k.update()
  File "main.py", line 255, in update
    self.__isolate_content(links)
  File "main.py", line 240, in __isolate_content
    desc = annotate(desc)
  File "main.py", line 79, in annotate
    type=enums.Document.Type.PLAIN_TEXT
TypeError: <div class="gs-o-media-island"><div class="gs-o-responsive-image gs-o-responsive-image--16by9 gs-o-r has type Tag, but expected one of: bytes, unicode
Traceback (most recent call last):
  File "main.py", line 502, in <module>
    k.update()
  File "main.py", line 226, in update
    self.__isolate_content(links)
  File "main.py", line 210, in __isolate_content
    desc = nltk.word_tokenize(desc)
  File "/home/michael/Code/sourcesbot/venv/lib/python3.6/site-packages/nltk/tokenize/__init__.py", line 130, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "/home/michael/Code/sourcesbot/venv/lib/python3.6/site-packages/nltk/tokenize/__init__.py", line 97, in sent_tokenize
    return tokenizer.tokenize(text)
  File "/home/michael/Code/sourcesbot/venv/lib/python3.6/site-packages/nltk/tokenize/punkt.py", line 1235, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/home/michael/Code/sourcesbot/venv/lib/python3.6/site-packages/nltk/tokenize/punkt.py", line 1283, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/home/michael/Code/sourcesbot/venv/lib/python3.6/site-packages/nltk/tokenize/punkt.py", line 1274, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "/home/michael/Code/sourcesbot/venv/lib/python3.6/site-packages/nltk/tokenize/punkt.py", line 1274, in <listcomp>
    return [(sl.start, sl.stop) for sl in slices]
  File "/home/michael/Code/sourcesbot/venv/lib/python3.6/site-packages/nltk/tokenize/punkt.py", line 1314, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "/home/michael/Code/sourcesbot/venv/lib/python3.6/site-packages/nltk/tokenize/punkt.py", line 312, in _pair_iter
    prev = next(it)
  File "/home/michael/Code/sourcesbot/venv/lib/python3.6/site-packages/nltk/tokenize/punkt.py", line 1287, in _slices_from_text
    for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or bytes-like object
michardy commented 6 years ago

Normally when parsing the BBC home page we assume some things. Namely:

Normal:

<!-- outer boundary of headline (never parsed) -->
<div class="gs-c-promo nw-c-promo gs-o-faux-block-link gs-u-pb gs-u-pb+@m nw-p-default gs-c-promo--inline gs-c-promo--stacked@m nw-u-w-auto gs-c-promo--flex" data-entityid="container-top-stories#4">
    <!-- Image data (never parsed) -->
    <div class="gs-c-promo-image gs-u-display-none gs-u-display-inline-block@xs gel-1/2@xs gel-1/1@m">
        <div class="gs-o-media-island">
            <div class="gs-o-responsive-image gs-o-responsive-image--16by9">
                <img src="https://ichef.bbci.co.uk/news/240/cpsprodpb/C06F/production/_101636294_whatsappimage2018-05-19at11.55.14am-1.jpg" class="lazyloaded" alt="Sabika Sheikh" data-src="https://ichef.bbci.co.uk/news/{width}/cpsprodpb/C06F/production/_101636294_whatsappimage2018-05-19at11.55.14am-1.jpg">
            </div>
        </div>
    </div>
    <!-- link, summary, and catagory links (we don't  parse this level) -->
    <div class="gs-c-promo-body gel-1/2@xs gel-1/1@m gs-u-mt@m">
        <!-- Link and summary (we don't look for this but get it with .parent) -->
        <div>
            <!-- target we look for -->
            <a class="gs-c-promo-heading gs-o-faux-block-link__overlay-link gel-pica-bold nw-o-link-split__anchor" href="/news/world-us-canada-44179973">
                <h3 class="gs-c-promo-heading__title gel-pica-bold nw-o-link-split__text">Pakistani student among Texas victims</h3>
            </a>
            <!-- next element that we assume is text if it exits -->
            <p class="gs-c-promo-summary gel-long-primer gs-u-mt nw-c-promo-summary">Sabika Sheikh and teacher Cynthia Tisdale are among the first named school shooting victims.</p>
        </div>
        <!-- ignore -->
        <ul class="gs-o-list-inline gs-o-list-inline--divided gel-brevier gs-u-mt-">
            <li class="nw-c-promo-meta"><span class="gs-c-timestamp gs-o-bullet gs-o-bullet- nw-c-timestamp">
                <span class="gs-o-bullet__icon gel-icon">
                    <svg viewBox="0 0 32 32" focusable="false"><polygon points="17,15.4 17,6 15,6 15,16.6 23.8,21.7 24.8,19.9"></polygon><path d="M16,4c6.6,0,12,5.4,12,12c0,6.6-5.4,12-12,12S4,22.6,4,16C4,9.4,9.4,4,16,4 M16,0C7.2,0,0,7.2,0,16c0,8.8,7.2,16,16,16 s16-7.2,16-16C32,7.2,24.8,0,16,0L16,0z"></path></svg>
                </span>
                <time class="gs-o-bullet__text date qa-status-date" datetime="2018-05-19T10:59:21.000Z" data-seconds="1526727561" data-datetime="12h">
                    <span aria-hidden="true" class="qa-status-date-output">12h</span>
                    <span class="gs-u-vh">12 hours ago</span>
                </time>
                <!-- Stray (Don't ask to many questions)-->
                </span>
            </li>
            <li class="nw-c-promo-meta">
                <a href="/news/world/us_and_canada" class="gs-c-section-link gs-c-section-link--truncate nw-c-section-link nw-o-link nw-o-link--no-visited-state" aria-label="From US &amp; Canada">
                    <span aria-hidden="true">US &amp; Canada</span>
                </a>
            </li>
        </ul>
    </div>
</div>

Breaks:

<!-- outer boundary of headline (never parsed but gotten with .parent) -->
<div class="gs-c-promo nw-c-promo nw-c-promo--maxim gel-layout gel-layout--no-flex gs-o-faux-block-link gs-u-pb gs-u-pb+@m" data-entityid="container-top-stories#1">
    <!-- target we look for -->
    <a class="gs-c-promo-heading gs-o-faux-block-link__overlay-link gel-layout__item gel-2/5@xxl gs-u-float-left-@xxl gs-u-mb gs-u-mb+@m gs-u-mb@xxl gel-canon-bold nw-o-link-split__anchor" href="/news/uk-44175216">
        <h3 class="gs-c-promo-heading__title gel-canon-bold nw-o-link-split__text">Prince Harry and Meghan married at Windsor</h3>
    </a>
    <!-- We assume this is a summary thing since it is the second child of the parent -->
    <!-- Everything blows up -->
    <div class="gs-c-promo-image gel-1/1 gel-3/4@l gel-3/5@xxl gs-u-float-right@l">
        <div class="gs-o-media-island">
            <div class="gs-o-responsive-image gs-o-responsive-image--16by9 gs-o-responsive-image--lead">
                <img src="https://ichef.bbci.co.uk/news/320/cpsprodpb/1184A/production/_101645717_hi046925116.jpg" sizes="(min-width: 900px) 743px, (min-width: 900px) calc(66vw - 64px), calc(vw100 - 32px)" srcset="https://ichef.bbci.co.uk/news/240/cpsprodpb/1184A/production/_101645717_hi046925116.jpg 240w, https://ichef.bbci.co.uk/news/380/cpsprodpb/1184A/production/_101645717_hi046925116.jpg 380w, https://ichef.bbci.co.uk/news/420/cpsprodpb/1184A/production/_101645717_hi046925116.jpg 420w, https://ichef.bbci.co.uk/news/490/cpsprodpb/1184A/production/_101645717_hi046925116.jpg 490w, https://ichef.bbci.co.uk/news/573/cpsprodpb/1184A/production/_101645717_hi046925116.jpg 573w, https://ichef.bbci.co.uk/news/743/cpsprodpb/1184A/production/_101645717_hi046925116.jpg 743w, https://ichef.bbci.co.uk/news/820/cpsprodpb/1184A/production/_101645717_hi046925116.jpg 820w" alt="Prince Harry and Meghan leave for their reception" class="qa-srcset-image">
            </div>
        </div>
    </div>
    <!-- Actual parent of summary -->
    <div class="gs-c-promo-body gel-1/3@m gel-1/4@l gel-2/5@xxl">
        <!-- Actual summary -->
        <p class="gs-c-promo-summary gel-long-primer gs-u-mt nw-c-promo-summary gs-u-mt+@m gs-u-mt0@l">Hundreds of guests watched the couple exchange vows in a ceremony featuring a gospel choir and an American preacher.</p>
        <ul class="gs-o-list-inline gs-o-list-inline--divided gel-brevier gs-u-mt-">
            <li class="nw-c-promo-meta"><span class="gs-c-timestamp gs-o-bullet gs-o-bullet- nw-c-timestamp">
                <span class="gs-o-bullet__icon gel-icon">
                    <svg viewBox="0 0 32 32" focusable="false"><polygon points="17,15.4 17,6 15,6 15,16.6 23.8,21.7 24.8,19.9"></polygon><path d="M16,4c6.6,0,12,5.4,12,12c0,6.6-5.4,12-12,12S4,22.6,4,16C4,9.4,9.4,4,16,4 M16,0C7.2,0,0,7.2,0,16c0,8.8,7.2,16,16,16 s16-7.2,16-16C32,7.2,24.8,0,16,0L16,0z"></path></svg>
                </span>
                <time class="gs-o-bullet__text date qa-status-date" datetime="2018-05-19T19:39:31.000Z" data-seconds="1526758771" data-datetime="3h">
                    <span aria-hidden="true" class="qa-status-date-output">3h</span>
                    <span class="gs-u-vh">3 hours ago</span>
                </time>
                <!-- Stray (Don't ask to many questions)-->
                </span>
            </li>
            <li class="nw-c-promo-meta">
                <a href="/news/uk" class="gs-c-section-link gs-c-section-link--truncate nw-c-section-link nw-o-link nw-o-link--no-visited-state" aria-label="From UK">
                    <span aria-hidden="true">UK</span>
                </a>
            </li>
        </ul>
    </div>
    <!-- ~150 loc of drooling over wedding gowns -->
    <div class="nw-c-live-event-wrapper gel-layout__item gel-2/3@m gel-1/4@l gs-u-float-right@xxl gel-1/5@xxl">
        <div class="nw-c-live-event gs-o-faux-block-link gs-u-mt+ gs-t-news">
            <div class="gs-c-promo lx-c-dynamic-promo lx-c-dynamic-promo--secondary gs-o-faux-block-link gs-u-align-left gs-u-ml0 gs-t-news lx-c-dynamic-promo--has-commentary nw-p-default gs-u-mb gs-u-mb+@m gs-u-pt-alt gs-u-pb- gs-u-ph-alt gs-c-promo--flex" data-mode="secondary">
                <div class="gs-c-promo-body lx-c-dynamic-promo__body gs-u-p0">
                    <div class="lx-c-timeline gel-pica-bold">
                        <div class="gs-u-pb+ lx-c-timeline__item lx-c-timeline__item--first">
                            <div>
                                <a class="gel-pica-bold nw-o-link-split__anchor lx-c-dynamic-promo__link gs-u-display-block qa-promo-title" href="/news/live/uk-44167290">
                                    <span class="gs-c-live-pulse gs-o-bullet gs-o-bullet- gs-c-live-pulse--news lx-c-dynamic-promo__pulse gs-u-mr gel-1/1">
                                        <span class="gs-o-bullet__icon gs-c-live-pulse__icon gel-icon">
                                            <svg aria-hidden="true" viewBox="0 0 32 32" focusable="false"><path d="M16 4c6.6 0 12 5.4 12 12s-5.4 12-12 12S4 22.6 4 16 9.4 4 16 4zm0-4C7.2 0 0 7.2 0 16s7.2 16 16 16 16-7.2 16-16S24.8 0 16 0z"></path></svg>
                                            <span class="gs-c-live-pulse__icon-center">
                                                <svg aria-hidden="true" viewBox="0 0 32 32" focusable="false"><circle cx="16" cy="16" r="8.5"></circle></svg>
                                            </span>
                                        </span>
                                        <span class="gs-o-bullet__text qa-live-pulse-text">Live</span>
                                    </span>
                                    <h3 class="gel-pica-bold nw-o-link-split__text lx-c-dynamic-promo__title">Couple cap happy day with private party</h3>
                                    <span class="gs-u-vh">Last updated 19 minutes ago</span>
                                </a>
                            </div>
                        </div>
                        <h4 class="gs-u-vh qa-timeline-hidden-heading">Most recent posts</h4>
                        <ol class="lx-c-timeline__list">
                            <li id="lx-c-timeline__item--0" class="lx-c-timeline__item qa-timeline-item gs-o-media gs-u-pb gs-u-pb-alt@l">
                                <div class="gs-u-mr- gs-o-media__img lx-c-timeline__keypoint">
                                    <div class="lx-c-timeline__keypoint-icon">
                                        <svg aria-hidden="true" viewBox="0 0 32 32" focusable="false"><circle stroke="none" cx="16" cy="16" r="11"></circle></svg>
                                    </div>
                                </div>
                                <div class="gs-o-media__body gel-long-primer lx-c-timeline__body">
                                    <span class="gs-u-vh qa-promo-item-heading">19 minutes ago 'Love recognises no barriers'</span>
                                    <time aria-hidden="true" class="lx-c-timeline__heading-timestamp gs-u-mr">19m</time>
                                    <span aria-hidden="true" class="lx-c-timeline__heading-text qa-item-heading-text">'Love recognises no barriers'</span>
                                </div>
                            </li>
                            <li id="lx-c-timeline__item--1" class="lx-c-timeline__item qa-timeline-item gs-o-media gs-u-pb gs-u-pb-alt@l">
                                <div class="gs-u-mr- gs-o-media__img lx-c-timeline__keypoint">
                                    <div class="lx-c-timeline__keypoint-icon">
                                        <svg aria-hidden="true" viewBox="0 0 32 32" focusable="false"><circle stroke="none" cx="16" cy="16" r="11"></circle></svg>
                                    </div>
                                </div>
                                <div class="gs-o-media__body gel-long-primer lx-c-timeline__body">
                                    <span class="gs-u-vh qa-promo-item-heading">27 minutes ago Royal fireworks</span>
                                    <time aria-hidden="true" class="lx-c-timeline__heading-timestamp gs-u-mr">27m</time>
                                    <span aria-hidden="true" class="lx-c-timeline__heading-text qa-item-heading-text">Royal fireworks</span>
                                </div>
                            </li>
                        </ol>
                    </div>
                </div>
                <a href="/news/live/uk-44167290" tabindex="-1" aria-hidden="true" class="qa-overlay gs-o-faux-block-link__overlay lx-c-dynamic-promo__link">Live Couple cap happy day with private party Last updated 19 minutes ago</a>
            </div>
        </div>
    </div>
    <div class="nw-c-index-alsos--maximum gel-layout__item gel-1/4@l gs-u-pt gs-u-pt-alt@xs gel-1/1@m gel-1/5@xxl">
        <div>
            <h4 class="gs-u-vh">Related content</h4>
            <ul class="gel-layout gel-layout--no-flex">
                <li class="nw-c-related-story nw-c-related-story--1 gel-1/2@s gel-1/1@l gel-1/3@m gs-u-float-left@s gs-u-float-none@l">
                    <span class="nw-o-bullet+ gel-brevier-bold">
                        <a href="/news/uk-44181399" class="gel-layout__item nw-o-link-split__anchor gs-u-pt- gs-u-pb- gs-u-display-block">
                            <span class="nw-o-bullet__icon">
                                <span class="gs-c-media-indicator gel-brevier-bold gs-c-media-indicator--inline">
                                    <span class="gs-c-media-indicator__icon gel-icon" data-icon="gel-icon-video">
                                        <span class="qa-offscreen gs-u-vh">Video</span>
                                        <svg aria-hidden="true" viewBox="0 0 32 32" focusable="false"><polygon points="3,32 29,16 3,0"></polygon></svg>
                                    </span>
                                </span>
                            </span>
                            <span class="nw-o-bullet__text">
                                <span class="nw-o-link-split__text gs-u-align-bottom">Harry and Meghan: The kiss</span>
                            </span>
                        </a>
                    </span>
                </li>
                <li class="nw-c-related-story nw-c-related-story--2 gel-1/2@s gel-1/1@l gel-1/3@m gs-u-float-left@s gs-u-float-none@l">
                    <span class="nw-o-bullet+ gel-brevier-bold">
                        <a href="/news/entertainment-arts-44180613" class="gel-layout__item nw-o-link-split__anchor gs-u-pt- gs-u-pb- gs-u-display-block">
                            <span class="nw-o-bullet__icon">
                                <span class="nw-c-circle">
                                    <svg aria-hidden="true" viewBox="0 0 32 32" focusable="false"><circle cx="12" cy="21" r="7"></circle></svg>
                                </span>
                            </span>
                            <span class="nw-o-bullet__text">
                                <span class="nw-o-link-split__text gs-u-align-bottom">In pictures: The celebrity guests</span>
                            </span>
                        </a>
                    </span>
                </li>
                <li class="nw-c-related-story nw-c-related-story--3 gel-1/2@s gel-1/1@l gel-1/3@m gs-u-float-left@s gs-u-float-none@l">
                    <span class="nw-o-bullet+ gel-brevier-bold">
                        <a href="/news/uk-44184151" class="gel-layout__item nw-o-link-split__anchor gs-u-pt- gs-u-pb- gs-u-display-block">
                            <span class="nw-o-bullet__icon">
                                <span class="gs-c-media-indicator gel-brevier-bold gs-c-media-indicator--inline">
                                    <span class="gs-c-media-indicator__icon gel-icon" data-icon="gel-icon-video">
                                        <span class="qa-offscreen gs-u-vh">Video</span>
                                        <svg aria-hidden="true" viewBox="0 0 32 32" focusable="false"><polygon points="3,32 29,16 3,0"></polygon></svg>
                                    </span>
                                </span>
                            </span>
                            <span class="nw-o-bullet__text">
                                <span class="nw-o-link-split__text gs-u-align-bottom">Carriage procession a 'fairytale'</span>
                            </span>
                        </a>
                    </span>
                </li>
                <li class="nw-c-related-story nw-c-related-story--4 gel-1/2@s gel-1/1@l gel-1/3@m gs-u-float-left@s gs-u-float-none@l">
                    <span class="nw-o-bullet+ gel-brevier-bold">
                        <a href="/news/uk-44184331" class="gel-layout__item nw-o-link-split__anchor gs-u-pt- gs-u-pb- gs-u-display-block">
                            <span class="nw-o-bullet__icon">
                                <span class="nw-c-circle">
                                    <svg aria-hidden="true" viewBox="0 0 32 32" focusable="false"><circle cx="12" cy="21" r="7"></circle></svg>
                                </span>
                            </span>
                            <span class="nw-o-bullet__text">
                                <span class="nw-o-link-split__text gs-u-align-bottom">#Blackroyalwedding hailed </span>
                            </span>
                        </a>
                    </span>
                </li>
                <li class="nw-c-related-story nw-c-related-story--5 gel-1/2@s gel-1/1@l gel-1/3@m gs-u-float-left@s gs-u-float-none@l">
                    <span class="nw-o-bullet+ gel-brevier-bold">
                        <a href="/news/uk-44182166" class="gel-layout__item nw-o-link-split__anchor gs-u-pt- gs-u-pb- gs-u-display-block">
                            <span class="nw-o-bullet__icon">
                                <span class="nw-c-circle">
                                    <svg aria-hidden="true" viewBox="0 0 32 32" focusable="false"><circle cx="12" cy="21" r="7"></circle></svg>
                                </span>
                            </span>
                            <span class="nw-o-bullet__text">
                                <span class="nw-o-link-split__text gs-u-align-bottom">Five moments from the wedding </span>
                            </span>
                        </a>
                    </span>
                </li>
                <li class="nw-c-related-story nw-c-related-story--6 gel-1/2@s gel-1/1@l gel-1/3@m gs-u-float-left@s gs-u-float-none@l">
                    <span class="nw-o-bullet+ gel-brevier-bold">
                        <a href="/news/uk-44184034" class="gel-layout__item nw-o-link-split__anchor gs-u-pt- gs-u-pb- gs-u-display-block">
                            <span class="nw-o-bullet__icon">
                                <span class="nw-c-circle">
                                    <svg aria-hidden="true" viewBox="0 0 32 32" focusable="false"><circle cx="12" cy="21" r="7"></circle></svg>
                                </span>
                            </span>
                            <span class="nw-o-bullet__text">
                                <span class="nw-o-link-split__text gs-u-align-bottom">The bridesmaids and pageboys</span>
                            </span>
                        </a>
                    </span>
                </li>
            </ul>
        </div>
    </div>
</div>