Open michael-spengler opened 3 years ago
scrapy.org
e.g. fĂŒr den Song Recommender...
daten in ein private repo --> klar den educational aspect herausstellen = disclaimer.
IPFS https://ipfs.io/
holistic AI based Web Crawler for different platforms.
how to mitigate ip based request blocking --> FaaS
using xpath as locator might be appropriate for "prepared" platforms: https://codecept.io/locators/#css-and-xpath
for those who like python based approaches more you might consider: https://www.youtube.com/watch?v=XVv6mJpFOb0
@michael-spengler we are currently planning our project and are faced with the question of complexity. Our plan is, for later projects (for example writing scientific reports) to crawl pages with scientific content and to file the headings and content. We limit ourselves to a few specified websites. The aim of the project is to enter a search text in a mask, from which the crawler then searches for data on the specified websites and later outputs and displays the results in the same UI.
@SimonScapan I find this a pretty interesting approach - combining NLP (NLU) with web harvesting... There are two students from WWI18DSA who might have developed something which might help you. They want to present it to me on Friday.
Feel free to create a corresponding Telegram Group and share its invitation link with me so I can connect you if you wish.
quality metrics for sources / articles ... hidden assumptions decreasing the score...
This is an private Repo ... but I've invited you as collaborator :)
example for gh action with write access to repo files.
UPDATE 15.06.2021:
Foodpath over google search api implemented Google scholar crawling over Proxy implemented Website can be started running api.py
Pending: Reset table on multiple inputs
On Friday:
UPDATE 17.06.2021:
@michael-spengler our work is now complete if you chill these hot days inside like us may you have some time to have a look at the project. We will present the results to you on friday. If there are some small changes we will fix them but all in all the project is completed and we are very happy for the results.
Exkurs regarding Python im Browser:
long term backend code deployment - e.g. on hetzner.de server
Long Term Feature Enhancement Proposal: Scientific Purpose Harvester + Free text based NLP training --> Q & A Pair generation --> FFC Content Generation.
Vereinfachung der Literaturrecherche.
Nutzung scraper api
harvesting on https://scholar.google.de/ seems a great idea đ
bibme.org
very valuable feature: wie oft wurde welche quelle zitiert?... --> future: in welchen papers wurden welche quelle zitiert... / und wo wurden diese papers wiederum veröffentlicht --> als zusÀtzliche QualitÀtsindikatoren...
network display who cited what? ....
Feedbacks vom 25.6.21
Top: Nach wie vor top use case...
Optimize:
README.md optimieren...
Long Term Deployment (frontend via svelte --> github pages based deployment)
Frontend via Svelte umgesetzt! :D
long term availability: http://85.214.28.167:5001/
Changes commited to REPO:
Gruppe: Data Magic Andreas Bernrieder, Simon Scapan, Jan Brebeck, Thorsten Hilbradt, Niklas Wichter