madwort / link_scraper

Simple PHP link scraper for @rconstanzo
0 stars 0 forks source link

parser not searching whole page #4

Closed rconstanzo closed 8 years ago

rconstanzo commented 9 years ago

most evident on the long blog posts by checking links/vids towards the bottom of page.

for example : http://johnlely.co.uk/ near the bottom of http://www.rodrigoconstanzo.com/2015/04/making-decisions-in-time/

or http://www.lcasserley.co.uk/ at the bottom of this : http://www.rodrigoconstanzo.com/2015/06/cut-glove/

madwort commented 8 years ago

Ok, I've been doing some digging into this, what's happening is there are errors in your HTML & this is tripping up both of the open-source HTML parsers written in PHP that I've tried. This is a very fiddly area - browsers tend to do a "best-guess" when confronted with malformed HTML & therefore often render something that looks ok, patching over any confusion under the hood, but programmatic parsers aren't often as highly developed.

I've been using the source for http://www.rodrigoconstanzo.com/2015/04/making-decisions-in-time/ as a test case, the problem that trips it up is on line 260 (when downloaded as a complete HTML file):

In the time between coming up with the system and creating the initial analyses and visualization 
tools I found that my ability to improvise had improved. Outlining these <em>decision streams</em>
created a mental framework, outside of data analysis, that allowed me to critically think and talk about 
my improvisation. This would materialize in a performative setting by allowing me to focus on the types 
of decisions I was making, and when they should happen, in a kind of meta-formal capacity/way.</p>

The </p> tag at the end is an erroneous extra tag that does not correspond to a <p> & causes the parsers I've tried to barf here.

The best thing would be if you were able to go through all the pages that need to be scraped & fix the errors - you can find where they are by using the W3C validator, eg. https://validator.w3.org/check?uri=http%3A%2F%2Fwww.rodrigoconstanzo.com%2F2015%2F04%2Fmaking-decisions-in-time%2F&charset=%28detect+automatically%29&doctype=Inline&group=0

You can ignore some of these errors (there's lots of errors on line 46 of that file caused by unescaped JavaScript, but fixing that looks like it'll mean editing one of your plugins, which is probably going to get annoying). The ones I'm interested firstly are the extra end tags, eg.

Line 186, Column 2172: end tag for element "p" which is not open

Do you think that's possible...?

madwort commented 8 years ago

(This will fix the tag cloud parser too...)

madwort commented 8 years ago

Seems to be a pattern that whenever a paragraph starts with <div class="su-pullquote su-pullquote-align-right “”"><strong><em>... it's missing a <p> so we see this error.

Just done a manual fix-up on the decisions page just removing bad </p> tags & verified that my parser then picks up loads more links (including eg. http://johnlely.co.uk ).

Nb. this could also be fixed by adding <p> tags in the appropriate place... It would be like this:

<div class="su-pullquote su-pullquote-align-right “”">...</div> then add <p> then the paragraph that ends with an unmatched </p>

rconstanzo commented 8 years ago

Hmm, yeah I can fix that stuff. I only really do pullquotes on the two big chapters (makingdecisions and cutglove), but those are also the longest two.

Ok had a look through and I guess the issue is with the plugin itself. In the text view in wordpress there's no extra </p> tags. In fact the only <p> tags at all are in the quoted text from you where I have <p style="padding-left: 30px;">.

madwort commented 8 years ago

Ahh, hmm, maybe a bug in the plugin then. By default if you have double-line breaks in the text as viewed from the Wordpress edit page, it will wrap them in <p> tags when rendering the page.

So this

text1

text2

is rendered as

<p>text1</p>
<p>text2</p>

How are you triggering the plugin then? Looks like something this this going on:

text1

[pullquote]text[/pullquote]bork

->

<p>text1</p>

<div pullquote>text</div>bork</p>
madwort commented 8 years ago

Maybe doing something like this in Wordpress text mode will work around it?

text1

[pullquote]text[/pullquote]
maybe-not-bork

?

rconstanzo commented 8 years ago

So an example paragraph with a pullquote in looks like this in the editor:

[su_pullquote align="right" class=“”]<em><strong>Conduction and Sound Painting are fundamentally modeled after a composer/performer hierarchy.</strong></em>[/su_pullquote]I view the compositional approach in <em>Cobra</em> as music by <em>improvisers</em> for improvisers. This is in contrast to the more top-down approach of something like <a href="https://en.wikipedia.org/wiki/Butch_Morris" target="_blank">Butch Morris’ Conduction</a> or <a href="https://en.wikipedia.org/wiki/Walter_Thompson_(composer)" target="_blank">Walter Thompson</a>’s <a href="https://en.wikipedia.org/wiki/Soundpainting" target="_blank">Sound Painting</a>. Both of those compositions—or rather, approaches to music—as they extend beyond a single composition, are fundamentally modeled after a composer/performer hierarchy. The conductors of these systems generally prescribe content, with varying degrees of specificity. Even built into the titles are the hierarchies of the <em>conductor</em> and the <em>painter</em>, of a single creator using the performers as their instruments/paints. I view this approach to composition as music by <em>composers</em> for improvisers.

madwort commented 8 years ago

Can you put a line-break between [/su_pullquote] and I? Does that fix the HTML? And does it break your visual formatting...?

madwort commented 8 years ago

(If the pullquote div is floated hopefully it will look the same)

rconstanzo commented 8 years ago

Hmm, that seems to work. I did the first one on this page: http://www.rodrigoconstanzo.com/2015/04/making-decisions-in-time/

Starting on "Shortly after conceiving..."

rconstanzo commented 8 years ago

Yeah that appears to look exactly the same. I did it to all of them on this page (first) since it's much shorter: http://www.rodrigoconstanzo.com/2015/05/this-is-why/

rconstanzo commented 8 years ago

Ok I've done all of the Making Decisions page now. The Cut Glove one has the same pullquotes in, so if this fixes that, I'll do it to that page too.

madwort commented 8 years ago

Just tested "Making Decisions", pretty sure that it's getting all the links now. Could you do this pullquotes fixup to any other relevant pages, then I can re-run the parser & email you the results?

rconstanzo commented 8 years ago

Yeah done it to all the pages now!

madwort commented 8 years ago

Ok, just run it, try the attached for size!

Tom

On 19 Dec 02015, at 17:01, Rodrigo Constanzo notifications@github.com wrote:

Yeah done it to all the pages now!

— Reply to this email directly or view it on GitHub.

rconstanzo commented 8 years ago

Looks like some of the ones towards the bottom of the cut glove blog post have gotten lumped together into superlinks (like this paper below). At this point really, it's not really important. The links are there and I need to do some clean up and remove redundancies etc...

So I think this issue is closed!

this paper -http://journals.cambridge.org/action/displayFulltext?type=1&fid=6357100&jid=SAM&volumeId=3&issueId=04&aid=6357092 -http://www.fastcodesign.com/3027564/asides/scientists-debunk-the-myth-that-10000-hours-of-practice-makes-you-an-expert -http://www.rodrigoconstanzo.com/the-party-van/ -http://samandreae.com/ -http://www.rodrigoconstanzo.com/2013/02/strikethrough-me-you-battle-pieces/ -http://www.rodrigoconstanzo.com/2011/05/towards-the-beat-of-a-different-drummer-a-journey-into-the-loss-of-fidelity-in-drums-and-electronics/ -http://www.rodrigoconstanzo.com/the-party-van/ -http://adamsliwinski.blogspot.co.uk/2013/07/read-memorize-cheat.html -http://www.perceptionweb.com/abstract.cgi?id=p7196 -http://en.wikipedia.org/wiki/Halo_(series) -http://line6.com/dl4/ -http://www.rodrigoconstanzo.com/karma/ -http://www.loopers-delight.com/tools/echoplex/echoplex.html -http://www.altruistmusic.com/EDP/ -http://www.rodrigoconstanzo.com/karma/ -http://www.rodrigoconstanzo.com/wp-content/uploads/2015/05/replace_append.jpg -http://www.rodrigoconstanzo.com/wp-content/uploads/2015/05/machinegun.jpg -http://samandreae.com/ -http://www.rodrigoconstanzo.com/wp-content/uploads/2015/05/pitch_curve_original.jpg -http://en.wikipedia.org/wiki/Signedness -http://www.rodrigoconstanzo.com/wp-content/uploads/2015/05/adaptive_smoothing.jpg -http://en.wikipedia.org/wiki/Halo_4 -http://halo.wikia.com/wiki/Sniper_Rifle_System_99-Series_5_Anti-Mat%C3%A9riel -http://www.rodrigoconstanzo.com/wp-content/uploads/2015/05/zoom.jpg -http://www.rodrigoconstanzo.com/wp-content/uploads/2015/05/analog_mapping.jpg -http://www.rodrigoconstanzo.com/wp-content/uploads/2015/05/stutterMapping.jpg -http://www.rodrigoconstanzo.com/wp-content/uploads/2015/06/dirt_fx.jpg -http://en.wikipedia.org/wiki/Street_Fighter -http://en.wikipedia.org/wiki/Fighting_game#Special_attacks -http://en.wikipedia.org/wiki/Street_Fighter#Hadouken -http://www.rodrigoconstanzo.com/wp-content/uploads/2015/06/pattr_storage.jpg -http://www.rodrigoconstanzo.com/wp-content/uploads/2015/06/combo_abstraction.jpg -http://www.rodrigoconstanzo.com/wp-content/uploads/2015/06/presets_bpatcher.jpg -http://line6.com/dl4/ -http://www.rodrigoconstanzo.com/wp-content/uploads/2015/06/pattern_recorder.jpg