michal-h21 / LuaXML

Fork of LuaXML (originally Paul Chakravarti)
14 stars 9 forks source link

Function to get all text nodes as a table #6

Closed ssimo3lsuhsc closed 1 month ago

ssimo3lsuhsc commented 1 year ago

Can I please request a function to succinctly get all text node descendants of a given element as a table and not as a string?

I have a bunch of nodes like the following, which are td elements with runs of text separated by br tags. I want to isolate each text run and perform some function on it alone.

<td aria-label="Pick a Session: Products : " tabindex="0">
08/01/2023 8:00 AM-9:00 AM Breakfast, Room 335 (Amount: 0.00 USD)<br>08/01/2023 9:00 AM-10:00 AM Welcome, Room 231 (Amount: 0.00 USD)
<br>
08/01/2023 2:00 PM-4:00 PM Breakout Session 2 (Amount: 0.00 USD, Pick a Session: EHS 101, Room 231)
<br>
08/02/2023 8:00 AM-10:00 AM Breakout Session 3 (Amount: 0.00 USD, Pick a Session: Literacy in the Classroom, Early Learning Center, 1st Floor)
<br>
08/02/2023 10:30 AM-12:30 PM Breakout Session 4 (Amount: 0.00 USD, Pick a Session: Learning Genie, Room 133)
<br>
08/02/2023 2:00 PM-4:00 PM Awards Reception, Room 335 (Amount: 0.00 USD)
<br>
Total: 0.00 USD
</td>

I'm coming to Lua from Python, and BeautifulSoup has had this feature for a long time now.

michal-h21 commented 1 year ago

Do you want to get only immediate child text nodes, or also from child elements? For example text elements from a given element can be retrieved using the following example:

kpse.set_program_name "luatex"
local x = [[
<td aria-label="Pick a Session: Products : " tabindex="0">
08/01/2023 8:00 AM-9:00 AM Breakfast, Room 335 (Amount: 0.00 USD)<br>08/01/2023 9:00 AM-10:00 AM Welcome, Room 231 (Amount: 0.00 USD)
<br>
08/01/2023 2:00 PM-4:00 PM Breakout Session 2 (Amount: 0.00 USD, Pick a Session: EHS 101, Room 231)
<br>
08/02/2023 8:00 AM-10:00 AM Breakout Session 3 (Amount: 0.00 USD, Pick a Session: Literacy in the Classroom, Early Learning Center, 1st Floor)
<br>
08/02/2023 10:30 AM-12:30 PM Breakout Session 4 (Amount: 0.00 USD, Pick a Session: Learning Genie, Room 133)
<br>
08/02/2023 2:00 PM-4:00 PM Awards Reception, Room 335 (Amount: 0.00 USD)
<br>
Total: 0.00 USD
</td>
]]

local domobject = require "luaxml-domobject"

local function element_texts(el) 
  local t = {}
  for _, child in ipairs(el:get_children()) do
    if child:is_text() then
      t[#t+1] = child:get_text()
    end
  end
  return t
end

local dom = domobject.parse(x)
for _, td in ipairs(dom:query_selector("td")) do
  local texts = element_texts(td)
  print(table.concat(texts, ";"))
end
ssimo3lsuhsc commented 1 year ago

As you can see in my example, all my children are either text nodes or void (br) elements. Whether I wanted to get all text nodes or only the direct descendants would produce the same result. All my text nodes ARE direct descendants. The function BeautifulSoup implements, accessed by the property Tag.stripped_strings, DOES get all descendant text nodes, and from that perspective, it would probably make more sense to implement it as a recursive walker in case some other user needs to retrieve grandchildren and great-grandchildren.

michal-h21 commented 1 year ago

So the example does what you look for?

ssimo3lsuhsc commented 1 year ago

Yes.

michal-h21 commented 11 months ago

Sorry for not getting back to you sooner, I forgot about this. I've added two new functions to the DOM object, DOM_Object:strings() and DOM_Object:stripped_strings(). They can be used like this:

kpse.set_program_name "luatex"
local x = [[
<td aria-label="Pick a Session: Products : " tabindex="0">
08/01/2023 8:00 AM-9:00 AM Breakfast, Room 335 (Amount: 0.00 USD)<br>08/01/2023 9:00 AM-10:00 AM Welcome, Room 231 (Amount: 0.00 USD)
<br>
08/01/2023 2:00 PM-4:00 PM Breakout Session 2 (Amount: 0.00 USD, Pick a Session: EHS 101, Room 231)
<br>
08/02/2023 8:00 AM-10:00 AM Breakout Session 3 (Amount: 0.00 USD, Pick a Session: Literacy in the Classroom, Early Learning Center, 1st Floor)
<br>
08/02/2023 10:30 AM-12:30 PM Breakout Session 4 (Amount: 0.00 USD, Pick a Session: Learning Genie, Room 133)
<br>
08/02/2023 2:00 PM-4:00 PM Awards Reception, Room 335 (Amount: 0.00 USD)
<br>
Total: 0.00 USD
</td>
]]

local domobject = require "luaxml-domobject"
local dom = domobject.parse(x)

for k,v in ipairs(dom:stripped_strings()) do
  print(v)
end

This is the result:

$ texlua sample.lua 
08/01/2023 8:00 AM-9:00 AM Breakfast, Room 335 (Amount: 0.00 USD)
08/01/2023 9:00 AM-10:00 AM Welcome, Room 231 (Amount: 0.00 USD)
08/01/2023 2:00 PM-4:00 PM Breakout Session 2 (Amount: 0.00 USD, Pick a Session: EHS 101, Room 231)
08/02/2023 8:00 AM-10:00 AM Breakout Session 3 (Amount: 0.00 USD, Pick a Session: Literacy in the Classroom, Early Learning Center, 1st Floor)
08/02/2023 10:30 AM-12:30 PM Breakout Session 4 (Amount: 0.00 USD, Pick a Session: Learning Genie, Room 133)
08/02/2023 2:00 PM-4:00 PM Awards Reception, Room 335 (Amount: 0.00 USD)
Total: 0.00 USD