rushter / selectolax

Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).
MIT License
1.14k stars 69 forks source link

Not work if it has self-closing iframe tag #60

Open hyeoni369 opened 2 years ago

hyeoni369 commented 2 years ago

Hi Selectolax!

I'm using selectolax very usefull.

I've found some problem.

When I use html with self closing iframe tag(<iframe ~~ />) selectolax can not find elements after this.

This is example code. Plaese check it :)

from selectolax.parser import HTMLParser

html_with_self_closing = '''
<html
  lang="en-us"
  data-reactroot=""
  class="js no-touch localstorage xhr2 no-ipad no-iphone no-ipod no-appleios no-android no-ioswebview no-facebookapp windows js no-touch localstorage xhr2 no-ipad no-iphone no-ipod no-appleios no-android no-ioswebview no-facebookapp windows js no-touch localstorage xhr2 no-ipad no-iphone no-ipod no-appleios no-android no-ioswebview no-facebookapp windows"
>
  <body
    class="kotaku blog-group-kotaku blog-recirc-group-fmgNonSatire permalink en-US f_ad_after_first_in_featured_permalinks_on f_ad_lightning_tag_on f_ad_refresh_enabled_on f_ad_script_in_head_on f_ad_timeout_amazon_on f_ad_timeout_failsafe_on f_ad_timeout_prebid_on f_ad_top_banner_to_featured_permalinks_on f_ads_viewability_desktop_on f_ads_viewability_mobile_on f_ads_viewability_pixels_offset_on f_ads_viewport_offset_on f_alerts_sidebar_on f_allow_blips_on f_amazon_aps_tag_on f_amazon_wait_for_bids_on f_amp_disable_after_2w_on f_amp_publisher_logo_on f_amp_slideshow_on f_amp_sticky_ad_on f_amp_video_extra_events_on f_analyticstracking_on f_author_page_canonical_on f_blip_show_first_sentence_on f_breadcrumbs_use_schemaorg_on f_channel_section_on f_chartbeat_video_on f_client_sidebar_blocks_on f_cls_on f_cls_mobile_inpost_on f_collapsed_top_ad_on f_comment_nofollow_on f_commenter_no_crawlable_on f_connatix_on f_connatix_mobile_on f_crop_modal_align_on f_curated_homepage_on f_ddrum_on f_disable_comment_meta_links_on f_disable_lazy_mfn_on f_disable_lazy_ymal_on f_disable_link_rendering_in_comments_on f_disguise_video_links_as_videos_on f_dynamic_ads_in_ads_bundle_on f_eager_load_ga_on f_editor_unload_3rdparty_on f_enable_bouncex_on f_enable_html_sitemap_on f_expanded_image_srcset_on f_fb_pixel_disable_on f_featured_ads_four_on f_filter_kinja_meta_on f_fivecardcarousel_on f_force_image_rights_on f_frontendtiming_on f_frontpage_recentvideo_on f_frontpage_sticky_leaderboard_on f_global_video_page_on f_goauthorurl_on f_header_anchor_tags_on f_header_simple_render_on f_hide_ellipsis_on f_hide_sticky_social_on f_homepage_layout_admin_only_on f_homepage_sticky_ads_4s_on f_homepage_video_playlist_on f_hp_smaller_images_on f_infinite_promotion_on f_infinite_scroll_on f_ix_identity_tag_on f_kargo_amp_on f_lazyload_iframes_on f_lazyload_twitter_iframe_on f_lazyload_youtube_iframe_on f_legacy_embiggen_on f_magma_permalink_video_truncation_on f_magnite_segments_on f_medianet_headerbidding_amp_on f_medianet_preload_on f_merge_price_vendor_on f_meta_first_on f_missing_image_alts_on f_mobile_comments_scroll_fix_on f_movable_ads_tool_shift_fix_on f_newsletter_inline_form_enabled_on f_newsletter_modal_subdomain_on f_newsletter_popup_exit_intent_on f_newsletter_popup_exit_intent_mobile_on f_no_follow_comment_links_on f_permalink_video_playlist_on f_prebid_on f_prebid_analytics_on f_prebid_autoconfig_on f_prebid_ias_enable_on f_prebid_resetdigital_on f_prebid_trustx_on f_prebid_video_on f_primary_header_flat_on f_primary_header_h1_on f_pure_save_button_on f_rail_video_playlist_on f_rail_video_stickiness_1500_on f_refresh_25_seconds_on f_refresh_ads_in_view_on f_related_stories_inset_on f_remove_h_tags_from_sidebar_on f_remove_sticky_h1_on f_restore_images_on f_section_nav_ga_events_on f_seo_content_first_on f_seo_iframe_noindex_on f_seo_noimageindex_on f_seo_remove_headline_link_on f_short_whitelisted_check_on f_show_splashy_top_on f_sidebar_ad_whitespace_on f_sidebar_remove_native_promo_on f_slideshow_on f_smartcrop_on f_sourcepoint_ccpa_on f_sourcepoint_header_on f_sourcepoint_keyval_on f_speedcurve_lux_on f_sticky_mobile_320_on f_sticky_mobile_320_bulbs_on f_sticky_right1_ad_on f_sticky_video_first_slot_mobile_on f_su_manage_blog_dropdown_on f_taboola_feed_homepage_on f_taboola_lazy_load_on f_tag_noindex_nofollow_on f_taxonomy_on f_trackonomics_amp_on f_truncate_permalink_content_on f_us_only_superhero_on f_use_ad_manager_on f_veritas_compression_on f_veritas_tracker_on f_video_autoplay_analytics_on f_video_hydration_lazyload_on f_video_lazy_load_delay_on f_video_permalink_play_next_on f_video_thumbnail_fix_on f_videos_filter_with_posts_on f_webm_optimize_on f_welcome_ad_analytics_on blog-group-kotaku"
  >
    <noscript>
      <iframe
        crossOrigin="true"
        src="https://www.googletagmanager.com/ns.html?id=GTM-TH42LHK"
        height="0"
        width="0"
        style="display: none; visibility: hidden"
      />
    </noscript>
    <div id="trackers" data-gtm-vis-polling-id-49090422_48="50">
      <script
        type="text/javascript"
        src="https://static.scroll.com/js/scroll.js"
        async="async"
      ></script>
      <script
        type="text/javascript"
        src="//static.chartbeat.com/js/chartbeat.js"
        async="async"
      ></script>
      <script
        type="text/javascript"
        src="https://kinja-com.videoplayerhub.com/gallery.js"
        async="async"
      ></script>
      <script
        type="text/javascript"
        src="https://sb.scorecardresearch.com/beacon.js"
        async="async"
      ></script>
    </div>
  </body>
</html>
'''

html_without_self_closing = '''
<html
  lang="en-us"
  data-reactroot=""
  class="js no-touch localstorage xhr2 no-ipad no-iphone no-ipod no-appleios no-android no-ioswebview no-facebookapp windows js no-touch localstorage xhr2 no-ipad no-iphone no-ipod no-appleios no-android no-ioswebview no-facebookapp windows js no-touch localstorage xhr2 no-ipad no-iphone no-ipod no-appleios no-android no-ioswebview no-facebookapp windows"
>
  <body
    class="kotaku blog-group-kotaku blog-recirc-group-fmgNonSatire permalink en-US f_ad_after_first_in_featured_permalinks_on f_ad_lightning_tag_on f_ad_refresh_enabled_on f_ad_script_in_head_on f_ad_timeout_amazon_on f_ad_timeout_failsafe_on f_ad_timeout_prebid_on f_ad_top_banner_to_featured_permalinks_on f_ads_viewability_desktop_on f_ads_viewability_mobile_on f_ads_viewability_pixels_offset_on f_ads_viewport_offset_on f_alerts_sidebar_on f_allow_blips_on f_amazon_aps_tag_on f_amazon_wait_for_bids_on f_amp_disable_after_2w_on f_amp_publisher_logo_on f_amp_slideshow_on f_amp_sticky_ad_on f_amp_video_extra_events_on f_analyticstracking_on f_author_page_canonical_on f_blip_show_first_sentence_on f_breadcrumbs_use_schemaorg_on f_channel_section_on f_chartbeat_video_on f_client_sidebar_blocks_on f_cls_on f_cls_mobile_inpost_on f_collapsed_top_ad_on f_comment_nofollow_on f_commenter_no_crawlable_on f_connatix_on f_connatix_mobile_on f_crop_modal_align_on f_curated_homepage_on f_ddrum_on f_disable_comment_meta_links_on f_disable_lazy_mfn_on f_disable_lazy_ymal_on f_disable_link_rendering_in_comments_on f_disguise_video_links_as_videos_on f_dynamic_ads_in_ads_bundle_on f_eager_load_ga_on f_editor_unload_3rdparty_on f_enable_bouncex_on f_enable_html_sitemap_on f_expanded_image_srcset_on f_fb_pixel_disable_on f_featured_ads_four_on f_filter_kinja_meta_on f_fivecardcarousel_on f_force_image_rights_on f_frontendtiming_on f_frontpage_recentvideo_on f_frontpage_sticky_leaderboard_on f_global_video_page_on f_goauthorurl_on f_header_anchor_tags_on f_header_simple_render_on f_hide_ellipsis_on f_hide_sticky_social_on f_homepage_layout_admin_only_on f_homepage_sticky_ads_4s_on f_homepage_video_playlist_on f_hp_smaller_images_on f_infinite_promotion_on f_infinite_scroll_on f_ix_identity_tag_on f_kargo_amp_on f_lazyload_iframes_on f_lazyload_twitter_iframe_on f_lazyload_youtube_iframe_on f_legacy_embiggen_on f_magma_permalink_video_truncation_on f_magnite_segments_on f_medianet_headerbidding_amp_on f_medianet_preload_on f_merge_price_vendor_on f_meta_first_on f_missing_image_alts_on f_mobile_comments_scroll_fix_on f_movable_ads_tool_shift_fix_on f_newsletter_inline_form_enabled_on f_newsletter_modal_subdomain_on f_newsletter_popup_exit_intent_on f_newsletter_popup_exit_intent_mobile_on f_no_follow_comment_links_on f_permalink_video_playlist_on f_prebid_on f_prebid_analytics_on f_prebid_autoconfig_on f_prebid_ias_enable_on f_prebid_resetdigital_on f_prebid_trustx_on f_prebid_video_on f_primary_header_flat_on f_primary_header_h1_on f_pure_save_button_on f_rail_video_playlist_on f_rail_video_stickiness_1500_on f_refresh_25_seconds_on f_refresh_ads_in_view_on f_related_stories_inset_on f_remove_h_tags_from_sidebar_on f_remove_sticky_h1_on f_restore_images_on f_section_nav_ga_events_on f_seo_content_first_on f_seo_iframe_noindex_on f_seo_noimageindex_on f_seo_remove_headline_link_on f_short_whitelisted_check_on f_show_splashy_top_on f_sidebar_ad_whitespace_on f_sidebar_remove_native_promo_on f_slideshow_on f_smartcrop_on f_sourcepoint_ccpa_on f_sourcepoint_header_on f_sourcepoint_keyval_on f_speedcurve_lux_on f_sticky_mobile_320_on f_sticky_mobile_320_bulbs_on f_sticky_right1_ad_on f_sticky_video_first_slot_mobile_on f_su_manage_blog_dropdown_on f_taboola_feed_homepage_on f_taboola_lazy_load_on f_tag_noindex_nofollow_on f_taxonomy_on f_trackonomics_amp_on f_truncate_permalink_content_on f_us_only_superhero_on f_use_ad_manager_on f_veritas_compression_on f_veritas_tracker_on f_video_autoplay_analytics_on f_video_hydration_lazyload_on f_video_lazy_load_delay_on f_video_permalink_play_next_on f_video_thumbnail_fix_on f_videos_filter_with_posts_on f_webm_optimize_on f_welcome_ad_analytics_on blog-group-kotaku"
  >
    <noscript>
      <iframe
        crossOrigin="true"
        src="https://www.googletagmanager.com/ns.html?id=GTM-TH42LHK"
        height="0"
        width="0"
        style="display: none; visibility: hidden"
      ></iframe>
    </noscript>
    <div id="trackers" data-gtm-vis-polling-id-49090422_48="50">
      <script
        type="text/javascript"
        src="https://static.scroll.com/js/scroll.js"
        async="async"
      ></script>
      <script
        type="text/javascript"
        src="//static.chartbeat.com/js/chartbeat.js"
        async="async"
      ></script>
      <script
        type="text/javascript"
        src="https://kinja-com.videoplayerhub.com/gallery.js"
        async="async"
      ></script>
      <script
        type="text/javascript"
        src="https://sb.scorecardresearch.com/beacon.js"
        async="async"
      ></script>
    </div>
  </body>
</html>
'''

tree_1 = HTMLParser(html_with_self_closing)
print('html_with_self_closing', tree_1.css('body > div'))
print('html_with_self_closing', tree_1.css('#trackers'))

tree_2 = HTMLParser(html_without_self_closing)
print('html_without_self_closing', tree_2.css('body > div'))
print('html_without_self_closing', tree_2.css('#trackers'))

Run this, parser can find div(#trackers) only when there is no self-closing iframe tag.

Thanks

rushter commented 2 years ago

This is a known issue on the modest side: https://github.com/lexborisov/Modest/issues/86