rushter / selectolax

Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).
MIT License
1.11k stars 68 forks source link

Bug Found within Lexbor #88

Closed AmericanY closed 1 year ago

AmericanY commented 1 year ago
x = '''
<table class="ct-data_table tr-data_table" style="margin:auto;width:80%;" border="1">
   <tbody>
      <tr style="text-align:left;">
         <th class="ct-header3 tr-pale_banner_color">
            <span style="display:inline;" class="term" data-term="condition/disease" title="Click to define" tabindex="0">Condition or disease <i class="fa fa-info-circle term" aria-hidden="true" data-term="condition/disease" style="border-bottom-style:none;" title="Click to define" tabindex="0"></i></span>       
         </th>
         <th class="ct-header3 tr-pale_banner_color">
            <span style="display:inline;" class="term" data-term="intervention/treatment" title="Click to define" tabindex="0">Intervention/treatment <i class="fa fa-info-circle term" aria-hidden="true" data-term="intervention/treatment" style="border-bottom-style:none;" title="Click to define" tabindex="0"></i></span>       
         </th>
         <th class="ct-header3 tr-pale_banner_color">
            <span style="display:inline;" class="term" data-term="phase" title="Click to define" tabindex="0">Phase <i class="fa fa-info-circle term" aria-hidden="true" data-term="phase" style="border-bottom-style:none;" title="Click to define" tabindex="0"></i></span>       
         </th>
      </tr>
      <tr style="text-align:left;vertical-align:top;">
         <td class="ct-body3">
            <span style="display:block;margin-bottom:1ex;">Diabetes Mellitus</span>
         </td>
         <td class="ct-body3">
            <span style="display:block;margin-bottom:1ex;">Dietary Supplement: Nutren Diabetes</span>
            <span style="display:block;margin-bottom:1ex;">Dietary Supplement: Fresubin Diabetes</span>
         </td>
         <td class="ct-body3" style="white-space:nowrap;">
            <span style="display:block;margin-bottom:1ex;">Not Applicable</span>
         </td>
      </tr>
   </tbody>
</table>
'''

soup1 = BeautifulSoup(x, 'lxml')
print(soup1.select_one(
    '.ct-data_table.tr-data_table[border="1"] td.ct-body3:nth-child(2)').get_text(strip=True, separator=', '))

soup2 = LexborHTMLParser(x)
print(soup2.css_first(
    '.ct-data_table.tr-data_table[border="1"] td.ct-body3:nth-child(2)').text(strip=True, separator=', '))

Output:

Bs4:

Dietary Supplement: Nutren Diabetes, Dietary Supplement: Fresubin Diabetes

Lexbor:

, Dietary Supplement: Nutren Diabetes, , Dietary Supplement: Fresubin Diabetes,
rushter commented 1 year ago

I don't view it as a bug. It's implementation detail. You have new lines between your tags and new lines are treated as text nodes. Since you use strip=True they are replaced with empty strings.

Just imagine that there is also text:

text
<span style="display:block;margin-bottom:1ex;">Dietary Supplement: Nutren Diabetes</span>
TEXT  <span style="display:block;margin-bottom:1ex;">Dietary Supplement: Fresubin Diabetes</span>
text

We don't want to lose it and I don't handle new lines as a special case. If you need to extract text from spans — it's better to iterate over each span and extract text from it.

AmericanY commented 1 year ago

@rushter Got it. Thank you.