pipes-digital / pipes

Repository for Pipes
https://pipes.digital
GNU Affero General Public License v3.0
254 stars 21 forks source link

Error when trying to extract tags with invalid IDs #70

Closed anewuser closed 4 years ago

anewuser commented 4 years ago

This website incorrectly uses IDs instead of classes to style similar elements. Interestingly, Pipes lists all the item titles (#contents_title) right, but it gives an error when it tries to extract #contents_text.

https://www.pipes.digital/editor/7N3d1EOy

onli commented 4 years ago

The mismatch is because it is behaviour by different software. In the extractor UI it's pure Javascript, while the extract block uses rubys nokogiri.

Not sure what do here. Maybe extract //html or //body and replace id=" with class="? But I just tried that, for this site that does not work, maybe the html is too broken. What does work is to extract //div, replace the id to class and then extract .contents_title. See https://www.pipes.digital/editor/3Npgdr9B. I made this public, so you could fork it and then put it via a pipe block into your existing pipe.

To fix this automatically with a html fixer script would be nice, but I'm not aware of one for ruby we could use.

anewuser commented 4 years ago

I meant that the titles are extracted correctly and the feed is working as it is. The problem happens with #contents_text, and I think that reason for that is that each tag with#contents_text also has a child with a duplicated ID. That explains why using #inner_box as a selector doesn't work either.

Anyway, since the whole code can't be fixed at once and other people are unlikely to find this problem, I'm going to close this. I've contacted the website author to see if he can fix it on his side.