Closed anewuser closed 4 years ago
The mismatch is because it is behaviour by different software. In the extractor UI it's pure Javascript, while the extract block uses rubys nokogiri.
Not sure what do here. Maybe extract //html
or //body
and replace id="
with class="
? But I just tried that, for this site that does not work, maybe the html is too broken. What does work is to extract //div
, replace the id to class and then extract .contents_title
. See https://www.pipes.digital/editor/3Npgdr9B. I made this public, so you could fork it and then put it via a pipe block into your existing pipe.
To fix this automatically with a html fixer script would be nice, but I'm not aware of one for ruby we could use.
I meant that the titles are extracted correctly and the feed is working as it is. The problem happens with #contents_text
, and I think that reason for that is that each tag with#contents_text
also has a child with a duplicated ID. That explains why using #inner_box
as a selector doesn't work either.
Anyway, since the whole code can't be fixed at once and other people are unlikely to find this problem, I'm going to close this. I've contacted the website author to see if he can fix it on his side.
This website incorrectly uses IDs instead of classes to style similar elements. Interestingly, Pipes lists all the item titles (
#contents_title
) right, but it gives an error when it tries to extract#contents_text
.https://www.pipes.digital/editor/7N3d1EOy