philss / floki

Floki is a simple HTML parser that enables search for nodes using CSS selectors.
https://hex.pm/packages/floki
MIT License
2.07k stars 156 forks source link

Add async version of traverse_and_update #325

Closed thiagomajesk closed 3 years ago

thiagomajesk commented 3 years ago

Hi! I'm using Floki to traverse an update a document where I'm generating URL previews. Since I'm unfurling those URLs while traversing the document, this operation can get quite expensive if done synchronously. Because of that, I'd like to make a proposal to make this process asynchronous by introducing a traverse_and_update_async function that would allow processing the matched nodes in parallel. Something like this:

Floki.traverse_and_update_async(fn -> 
  {"a", [{"href", href}], _children} -> 
  Task.async(fn -> {"div", [], unfurl(href)} end)
  html_attribute -> Task.async(html_attribute)
end)

The function traverse_and_update_async would expect a Task to be returned and then we could: Task.await_many(tasks) at the end to collect the results.

PS.: I think that depending on how we want to treat nested nodes, we would have to preemptively evaluate/ await some of the tasks because the modified value doesn't exist yet.

philss commented 3 years ago

Hi @thiagomajesk, thanks for opening the issue! :purple_heart:

I'm inclined to not add this feature because you don't have control of how many process it would create, and also because I want to keep floki without processes in order to keep it simple.

Considering that we would have to traverse the tree again to await the modifications, what about this approach for your case:

WDYT?

thiagomajesk commented 3 years ago

Hi @philss!

I'm inclined to not add this feature because you don't have control of how many process it would create, and also because I want to keep floki without processes in order to keep it simple.

Humm, I see... I'll close the issue then.

BTW, thanks for trying to help. I'll test your suggestion. Cheers!