taroved / pol

RSS generator website
MIT License
381 stars 88 forks source link

Simpler Architecture (without database) #4

Closed Xyrio closed 6 years ago

Xyrio commented 6 years ago

you could make it work without a database at all. just put everything into a link like:

http://politepol.com/en/setup?url=https%3A//github.com/rssowl/RSSOwl/issues&title=a.link-gray-dark&description=span.opened-by

the link can now be used in any rss aggregator to get the rss and you save yourself many potential problems with a database.

taroved commented 6 years ago

Before you pressed button "Create" (feed) database is not used. Database used for saving of feed properties.

If you have some efficiency or configuration problems with database, please let me know.

Xyrio commented 6 years ago

put the feed properties into the url.

you can still track the resulting url when it is run on your server if you like. but having a database should be optional.

no problem with the database except that i dont want to configure, run and maintain one.

taroved commented 6 years ago

This concept is interesting, but you will have potential problems with feed post creation time. Currently every new post have creation time and this information is mentioned in feed. Without database this time mark will be not saved. And from time to time users will have some problems related to this post information. Some feed readers use this time marks as information for identification of post as new or as old. So they can see the same post as new again and again.

I can create databaseless mode which can be switched if you don't care about potential problems with time marks.

Xyrio commented 6 years ago

the creation time issue is a good point but you could put the creation date into the link as a parameter too and just use it from there. this would allow people to easyly manipulate creation time if needed too. simply use the link as storage instead of the database.

regarding the update-datetime ( #3 ) for the simpler architecture, i would set it by default to be the same as the creation-date unless there is a selector to extract it or when missing/unparsable fallback to creation-datetime.

taroved commented 6 years ago

I meant every post (entry, record) have creation date and time. Not feed. Every feed has list of posts. When the new post is appear the creation time is saved to db and provided to user. Should I clarify?

Whatever it was this databaseless mode is easy to implement. If the mode will be popular it will be implemented sooner. May be some contributor find time for this

Xyrio commented 6 years ago

yes every post same creation-date and same update-date by default. only when css-selector is specified you use individual update-date.

i tried it out and it works. it recognizes updates when either pubDate or lastBuildDate changes while guid stays the same. i would set guid same as link for the individual item.

tested with rssowl

for one item but should work for many just the same.

im using different pubDate and lastBuildDate in the test to see what ends up in the program (pubDate is preferred when both change)

<rss version="2.0">
<channel><title>test</title>
<link>http://towatch.notexist</link>
<description></description>
<item>
  <guid>http://one.link</guid>
  <author>one.autor</author>
  <title>one.title</title>
  <link>http://one.link</link>
  <pubDate>06 Sep 2009 16:20:00 +0000</pubDate>
  <lastBuildDate>06 Sep 2010 00:01:00 +0000 </lastBuildDate>
</item>
</channel>
</rss>
<rss version="2.0">
<channel><title>test</title>
<link>http://towatch.notexist</link>
<description></description>
<item>
  <guid>http://one.link</guid>
  <author>one.autor change</author>
  <title>one.title change</title>
  <link>http://one.link</link>
  <pubDate>06 Sep 2009 16:20:00 +0000</pubDate>
  <lastBuildDate>06 Sep 2010 00:02:00 +0000 </lastBuildDate>
</item>
</channel>
</rss>
<rss version="2.0">
<channel><title>test</title>
<link>http://towatch.notexist</link>
<description></description>
<item>
  <guid>http://one.link</guid>
  <author>one.autor change</author>
  <title>one.title change</title>
  <link>http://one.link</link>
  <pubDate>06 Sep 2009 16:21:00 +0000</pubDate>
  <lastBuildDate>06 Sep 2010 00:01:00 +0000 </lastBuildDate>
</item>
</channel>
</rss>

if thats not enough to convince you, i give up :)

taroved commented 6 years ago

Ok. In any case this mode have sense only with full list of fields (pub date and such). So this functionality may be not implemented with next update.

But If you do pull request with required changes I will have to respect you ;)

taroved commented 6 years ago

You have great opportunity to get the feature in a week if you really need it.

Just donate: https://www.bountysource.com/issues/52548318-simpler-architecture

Let me know what you think

Xyrio commented 6 years ago

found this for github: https://api.github.com/repos/taroved/pol/issues

i wrote this prototype, feel free to use what you like:

#pip install beautifulsoup4
#(pyquery and lxml have faster selectors than bs4 but bs4 is pure python?)
#https://gist.github.com/MercuryRising/4061368
from bs4 import BeautifulSoup #parse websites, css selectors 
import urllib.request
import urllib.parse
import os.path
import re

def createRequestLink():
    paramsDict = {
        "url":"https://github.com/taroved/pol/issues", 
        "stitle":"a.link-gray-dark", #a* default #text
        "slink":"a.link-gray-dark", #a* default attr.href
        "sdescription":"span.opened-by", #a* default #text
        "created-feed":"2017-06-05 14:03", #a* default #text
        #"screated":"?", #a* default #text
        #"supdated":"relative-time", #"aupdated":"datetime", #a* default "#text"
    }
    params = urllib.parse.urlencode(paramsDict, quote_via=urllib.parse.quote_plus)
    return "http://foo.bar/?%s" % params

def getParams(urlin):
    parseResult = urllib.parse.urlsplit(urlin)
    print(parseResult)
    return urllib.parse.parse_qsl(parseResult.query)

def rssgen(params, html):
    print(params)
    url = params["url"]
    site = url.split("/")
    site = site[0]+"//"+site[2] #http://site.com

    soup = BeautifulSoup(html, "html.parser")

    created_feed = params.get("created-feed")

    sel_title       = params["stitle"]
    sel_link        = params.get("slink")
    sel_description = params.get("sdescription")
    sel_created     = params.get("screated")
    sel_updated     = params.get("supdated")

    attr_title       = params.get("atitle")
    attr_link        = params.get("alink")
    attr_description = params.get("adescription")
    attr_created     = params.get("acreated")
    attr_updated     = params.get("aupdated")

    def getAttr(tag, attr):
        return tag.attrs.get(attr)

    titles = soup.select(sel_title)
    if attr_title:
        titles = [ getAttr(e, attr_title) for e in titles ]
    else:
        titles = [ re.sub("\s+"," ","".join(e.strings)).strip() for e in titles ]

    links = sel_link if sel_link else sel_title
    if links:
        links = soup.select(links)
        if attr_link:
            links = [ getAttr(e, attr_link) for e in links ]
        else:
            links = [ e["href"] if e["href"].startswith("http") else site+e["href"] for e in links ]

    descriptions = sel_description
    if descriptions:
        descriptions = soup.select(descriptions)
        if attr_description:
            descriptions = [ getAttr(e, attr_description) for e in descriptions ]
        else:
            descriptions = [ re.sub("\s+"," ","".join(e.strings)).strip() for e in descriptions ]

    createds = sel_created
    if createds:
        createds = soup.select(createds)
        if attr_created:
            createds = [ getAttr(e, attr_created) for e in createds ]
        else:
            createds = [ "".join(e.strings).strip() for e in createds ]
    else:
        createds = [ created_feed for i in titles ]

    updateds = sel_updated
    if updateds:
        updateds = soup.select(updateds)
        if attr_updated:
            updateds = [ getAttr(e, attr_updated) for e in updateds ]
        else:
            updateds = [ "".join(e.strings).strip() for e in updateds ]
    else:
        updateds = [ created_feed for i in titles ]

    print(titles)
    print(links)
    print(descriptions)
    print(createds)
    print(updateds)

if __name__ == '__main__':
    urlin = createRequestLink() #for testing

    params = dict(getParams(urlin))
    url = params["url"]

    fname = "tmp.html"

    html = None
    if os.path.isfile(fname):
        with open(fname, "rb") as file:
            html = file.read()
    if html == None:
        html = urllib.request.urlopen(url).read()
        with open(fname, "wb") as file:
            file.write(html)

    rssgen(params, html)    
taroved commented 6 years ago

Thank you for response, but I'm not sure I understand what this snippet is doing

Xyrio commented 6 years ago

it extracts data form a website using css selectors. the prints show the data which in another step can be mapped to a rss feed structure. no database required.

otherwise im using list comprehension for less code.

taroved commented 6 years ago

This code is too far from solution which can be integrated to this project (PolitePol). But thank you for participation in life of this project. It's very valuable for future of the project

taroved commented 6 years ago

I extended configuration of mysql. This is not databaseless solution, but may be it is a good solution Check out readme information