Closed Xyrio closed 6 years ago
Before you pressed button "Create" (feed) database is not used. Database used for saving of feed properties.
If you have some efficiency or configuration problems with database, please let me know.
put the feed properties into the url.
you can still track the resulting url when it is run on your server if you like. but having a database should be optional.
no problem with the database except that i dont want to configure, run and maintain one.
This concept is interesting, but you will have potential problems with feed post creation time. Currently every new post have creation time and this information is mentioned in feed. Without database this time mark will be not saved. And from time to time users will have some problems related to this post information. Some feed readers use this time marks as information for identification of post as new or as old. So they can see the same post as new again and again.
I can create databaseless mode which can be switched if you don't care about potential problems with time marks.
the creation time issue is a good point but you could put the creation date into the link as a parameter too and just use it from there. this would allow people to easyly manipulate creation time if needed too. simply use the link as storage instead of the database.
regarding the update-datetime ( #3 ) for the simpler architecture, i would set it by default to be the same as the creation-date unless there is a selector to extract it or when missing/unparsable fallback to creation-datetime.
I meant every post (entry, record) have creation date and time. Not feed. Every feed has list of posts. When the new post is appear the creation time is saved to db and provided to user. Should I clarify?
Whatever it was this databaseless mode is easy to implement. If the mode will be popular it will be implemented sooner. May be some contributor find time for this
yes every post same creation-date and same update-date by default. only when css-selector is specified you use individual update-date.
i tried it out and it works. it recognizes updates when either pubDate or lastBuildDate changes while guid stays the same. i would set guid same as link for the individual item.
tested with rssowl
for one item but should work for many just the same.
im using different pubDate and lastBuildDate in the test to see what ends up in the program (pubDate is preferred when both change)
<rss version="2.0">
<channel><title>test</title>
<link>http://towatch.notexist</link>
<description></description>
<item>
<guid>http://one.link</guid>
<author>one.autor</author>
<title>one.title</title>
<link>http://one.link</link>
<pubDate>06 Sep 2009 16:20:00 +0000</pubDate>
<lastBuildDate>06 Sep 2010 00:01:00 +0000 </lastBuildDate>
</item>
</channel>
</rss>
<rss version="2.0">
<channel><title>test</title>
<link>http://towatch.notexist</link>
<description></description>
<item>
<guid>http://one.link</guid>
<author>one.autor change</author>
<title>one.title change</title>
<link>http://one.link</link>
<pubDate>06 Sep 2009 16:20:00 +0000</pubDate>
<lastBuildDate>06 Sep 2010 00:02:00 +0000 </lastBuildDate>
</item>
</channel>
</rss>
<rss version="2.0">
<channel><title>test</title>
<link>http://towatch.notexist</link>
<description></description>
<item>
<guid>http://one.link</guid>
<author>one.autor change</author>
<title>one.title change</title>
<link>http://one.link</link>
<pubDate>06 Sep 2009 16:21:00 +0000</pubDate>
<lastBuildDate>06 Sep 2010 00:01:00 +0000 </lastBuildDate>
</item>
</channel>
</rss>
if thats not enough to convince you, i give up :)
Ok. In any case this mode have sense only with full list of fields (pub date and such). So this functionality may be not implemented with next update.
But If you do pull request with required changes I will have to respect you ;)
You have great opportunity to get the feature in a week if you really need it.
Just donate: https://www.bountysource.com/issues/52548318-simpler-architecture
Let me know what you think
found this for github: https://api.github.com/repos/taroved/pol/issues
i wrote this prototype, feel free to use what you like:
#pip install beautifulsoup4
#(pyquery and lxml have faster selectors than bs4 but bs4 is pure python?)
#https://gist.github.com/MercuryRising/4061368
from bs4 import BeautifulSoup #parse websites, css selectors
import urllib.request
import urllib.parse
import os.path
import re
def createRequestLink():
paramsDict = {
"url":"https://github.com/taroved/pol/issues",
"stitle":"a.link-gray-dark", #a* default #text
"slink":"a.link-gray-dark", #a* default attr.href
"sdescription":"span.opened-by", #a* default #text
"created-feed":"2017-06-05 14:03", #a* default #text
#"screated":"?", #a* default #text
#"supdated":"relative-time", #"aupdated":"datetime", #a* default "#text"
}
params = urllib.parse.urlencode(paramsDict, quote_via=urllib.parse.quote_plus)
return "http://foo.bar/?%s" % params
def getParams(urlin):
parseResult = urllib.parse.urlsplit(urlin)
print(parseResult)
return urllib.parse.parse_qsl(parseResult.query)
def rssgen(params, html):
print(params)
url = params["url"]
site = url.split("/")
site = site[0]+"//"+site[2] #http://site.com
soup = BeautifulSoup(html, "html.parser")
created_feed = params.get("created-feed")
sel_title = params["stitle"]
sel_link = params.get("slink")
sel_description = params.get("sdescription")
sel_created = params.get("screated")
sel_updated = params.get("supdated")
attr_title = params.get("atitle")
attr_link = params.get("alink")
attr_description = params.get("adescription")
attr_created = params.get("acreated")
attr_updated = params.get("aupdated")
def getAttr(tag, attr):
return tag.attrs.get(attr)
titles = soup.select(sel_title)
if attr_title:
titles = [ getAttr(e, attr_title) for e in titles ]
else:
titles = [ re.sub("\s+"," ","".join(e.strings)).strip() for e in titles ]
links = sel_link if sel_link else sel_title
if links:
links = soup.select(links)
if attr_link:
links = [ getAttr(e, attr_link) for e in links ]
else:
links = [ e["href"] if e["href"].startswith("http") else site+e["href"] for e in links ]
descriptions = sel_description
if descriptions:
descriptions = soup.select(descriptions)
if attr_description:
descriptions = [ getAttr(e, attr_description) for e in descriptions ]
else:
descriptions = [ re.sub("\s+"," ","".join(e.strings)).strip() for e in descriptions ]
createds = sel_created
if createds:
createds = soup.select(createds)
if attr_created:
createds = [ getAttr(e, attr_created) for e in createds ]
else:
createds = [ "".join(e.strings).strip() for e in createds ]
else:
createds = [ created_feed for i in titles ]
updateds = sel_updated
if updateds:
updateds = soup.select(updateds)
if attr_updated:
updateds = [ getAttr(e, attr_updated) for e in updateds ]
else:
updateds = [ "".join(e.strings).strip() for e in updateds ]
else:
updateds = [ created_feed for i in titles ]
print(titles)
print(links)
print(descriptions)
print(createds)
print(updateds)
if __name__ == '__main__':
urlin = createRequestLink() #for testing
params = dict(getParams(urlin))
url = params["url"]
fname = "tmp.html"
html = None
if os.path.isfile(fname):
with open(fname, "rb") as file:
html = file.read()
if html == None:
html = urllib.request.urlopen(url).read()
with open(fname, "wb") as file:
file.write(html)
rssgen(params, html)
Thank you for response, but I'm not sure I understand what this snippet is doing
it extracts data form a website using css selectors. the prints show the data which in another step can be mapped to a rss feed structure. no database required.
otherwise im using list comprehension for less code.
This code is too far from solution which can be integrated to this project (PolitePol). But thank you for participation in life of this project. It's very valuable for future of the project
I extended configuration of mysql. This is not databaseless solution, but may be it is a good solution Check out readme information
you could make it work without a database at all. just put everything into a link like:
http://politepol.com/en/setup?url=https%3A//github.com/rssowl/RSSOwl/issues&title=a.link-gray-dark&description=span.opened-by
the link can now be used in any rss aggregator to get the rss and you save yourself many potential problems with a database.