lwindolf / liferea

Liferea (Linux Feed Reader), a news reader for GTK/GNOME
https://lzone.de/liferea
GNU General Public License v2.0
816 stars 130 forks source link

Auto article scraper doesn't appear to work with danielberlinger.github.io #1140

Closed sjehuda closed 1 year ago

sjehuda commented 1 year ago

For http://danielberlinger.github.io/

Liferea extracts

Friday Nov. 18, 2011 Droids

Attention. These are not... oh to hell with it.

<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>A Start</title>
    <meta name="description" content="Scribbles from dev etc.">
    <meta name="author" content="Daniel Berlinger">
    <link rel="stylesheet" href="main.css?v=1.0">
  </head>
  <body>
    <div id='page'>
      <div class='main'>
        <div class='blog'>
          <article class='post'>
            <div class='date'>
              <a href='/'>Mon. Feb. 13, 2012</a>
            </div>
            <h1><a href='https://gist.github.com/29387035fc5f03f889dc'>Links about feature switching/config and realtime graphing</a></h1>
            <div class='content'>
              <p>A brief bit of research. Links are <a href="https://gist.github.com/29387035fc5f03f889dc">here</a>.</p>
            </div>
          </article>
          <article class='post'>
            <div class='date'>
              <a href='/'>Tue. Dec. 20, 2011</a>
            </div>
            <h1><a href='http://danielberlinger.github.com/turnings-all-wordpress-2011-12-20.xml'>RSS backup of turnings</a></h1>
            <div class='content'>
              <p>Part of the usual end of year house cleaning.</p>
              <p>It can also be found <a href="http://tales.phrasewise.com/turnings/eoy-2011/2011-12-20.xml">here</a>.</p>
            </div>
          </article>
          <article class='post'>
            <div class='date'>
              <a href='/'>Mon. Dec. 19, 2011</a>
            </div>
            <h1><a href='https://github.com/danielberlinger/reconciler'>Reconciler</a></h1>
            <div class='content'>
              This project creates ActiveRecord models for the sake of reconciling two database tables.It uses Redis to do the resolve the sets, and then you could can either send a message to some other system or use AR to create matching records. It was originally written to compare two systems that got their data from different sources that should have been the same but weren't. <a href="https://github.com/danielberlinger/reconciler">Here.</a>
            </div>
          </article>
          <article class='post'>
            <div class='date'>
              <a href='/'>Mon. Dec. 19, 2011</a>
            </div>
            <h1><a href='https://gist.github.com/1499006'>How Homebrew starts Redis</a></h1>
            <div class='content'>
              <script src="https://gist.github.com/1499006.js?file=indirect_start_redis_server.sh"></script>
            </div>
          </article>
          <article class='post'>
            <div class='date'>
              <a href='/'>Wed. Dec. 14, 2011</a>
            </div>
            <h1><a href='https://gist.github.com/945080'>Extendable Metal Redirects</a></h1>
            <div class='content'>
              To long to embed. <a href="https://gist.github.com/945080">Here.</a>
            </div>
          </article>
          <article class='post'>
            <div class='date'>
              <a href='/'>Wed. Dec. 14, 2011</a>
            </div>
            <h1><a href='https://gist.github.com/47cd6a3a1dc198fd033b'>A hooks pattern in Ruby</a></h1>
            <div class='content'>
              To long to embed. <a href="https://gist.github.com/47cd6a3a1dc198fd033b">Here.</a>
            </div>
          </article>
          <article class='post'>
            <div class='date'>
              <a href='/'>Tues. Nov. 22, 2011</a>
            </div>
            <h1><a href='https://gist.github.com/1386036'>Simple RedisToGo connection in Ruby</a></h1>
            <div class='content'>
              <script src="https://gist.github.com/1386036.js"> </script>
            </div>
          </article>
          <article class='post'>
            <div class='date'>
              <a href='/'>Friday Nov. 18, 2011</a>
            </div>
            <h1><a href='http://danielberlinger.github.com/rsd'>Really Simple Discovery</a></h1>
            <div class='content'>
              <p>Really Simple Discovery (RSD) is an XML format and a publishing convention for making services exposed by blog, or other web software, discoverable by client software. It reduces the information required to set up editing/blogging software to a minimum.</p>
              <p><a href="http://archipelago.phrasewise.com/display?page=oldsite/1330.html">The original</a></p>
              <p><a href="http://tales.phrasewise.com/rfc/rsd.html">The copy Google points to most prominently as I write this.</a></p>
              <p><a href="http://danielberlinger.github.com/rsd">Re-archived here.</a></p>
              <p>Repository here: <a href="http://github.com/danielberlinger/rsd">Really Simple Discovery</a></p>
              <p><a href="http://en.wikipedia.org/wiki/Really_Simple_Discovery">WikiPedia Article </a></p>
            </div>
          </article>
          <article class='post'>
            <div class='date'>
              <a href='/'>Friday Nov. 18, 2011</a>
            </div>
            <h1><a href='/'>Droids</a></h1>
            <div class='content'>
              <p>Attention. These are not... oh to hell with it.</p>
            </div>
          </article>
        </div>
      </div>
    </div>
  </body>
</html>
lwindolf commented 1 year ago

The problem is the date link pointing to "/". Liferea needs to extract a link to construct the headline identity. To do so it uses the first one. By identity check all the headlines are the same and only one (the last one is chosen).

As headline identity is very important I see no alternative to extracting the first link and use it.