Web Harvesting using testing frameworks like CodeceptJS

Phantomias3782 commented 3 years ago

Gruppe: Data Magic Andreas Bernrieder, Simon Scapan, Jan Brebeck, Thorsten Hilbradt, Niklas Wichter

michael-spengler commented 3 years ago

https://codecept.io/helpers/Puppeteer/#grabhtmlfrom

michael-spengler commented 3 years ago

scrapy.org

michael-spengler commented 3 years ago

e.g. für den Song Recommender...

michael-spengler commented 3 years ago

daten in ein private repo --> klar den educational aspect herausstellen = disclaimer.

IPFS https://ipfs.io/

michael-spengler commented 3 years ago

holistic AI based Web Crawler for different platforms.

how to mitigate ip based request blocking --> FaaS

michael-spengler commented 3 years ago

using xpath as locator might be appropriate for "prepared" platforms: https://codecept.io/locators/#css-and-xpath

michael-spengler commented 3 years ago

for those who like python based approaches more you might consider: https://www.youtube.com/watch?v=XVv6mJpFOb0

SimonScapan commented 3 years ago

@michael-spengler we are currently planning our project and are faced with the question of complexity. Our plan is, for later projects (for example writing scientific reports) to crawl pages with scientific content and to file the headings and content. We limit ourselves to a few specified websites. The aim of the project is to enter a search text in a mask, from which the crawler then searches for data on the specified websites and later outputs and displays the results in the same UI.

michael-spengler commented 3 years ago

@SimonScapan I find this a pretty interesting approach - combining NLP (NLU) with web harvesting... There are two students from WWI18DSA who might have developed something which might help you. They want to present it to me on Friday.

Feel free to create a corresponding Telegram Group and share its invitation link with me so I can connect you if you wish.

michael-spengler commented 3 years ago

quality metrics for sources / articles ... hidden assumptions decreasing the score...

SimonScapan commented 3 years ago

This is an private Repo ... but I've invited you as collaborator :)

https://github.com/SimonScapan/scientific-purpose-harvester

michael-spengler commented 3 years ago

example for gh action with write access to repo files.

https://github.com/cla-assistant/github-action#configure-contributor-license-agreement-within-two-minutes

Phantomias3782 commented 3 years ago

UPDATE 15.06.2021:

Foodpath over google search api implemented Google scholar crawling over Proxy implemented Website can be started running api.py

Pending: Reset table on multiple inputs

On Friday:

show Results

SimonScapan commented 3 years ago

UPDATE 17.06.2021:

@michael-spengler our work is now complete if you chill these hot days inside like us may you have some time to have a look at the project. We will present the results to you on friday. If there are some small changes we will fix them but all in all the project is completed and we are very happy for the results.

michael-spengler commented 3 years ago

Exkurs regarding Python im Browser:

skulpt.org
brython.info

long term backend code deployment - e.g. on hetzner.de server

michael-spengler commented 3 years ago

Long Term Feature Enhancement Proposal: Scientific Purpose Harvester + Free text based NLP training --> Q & A Pair generation --> FFC Content Generation.

Vereinfachung der Literaturrecherche.

Nutzung scraper api harvesting on https://scholar.google.de/ seems a great idea 👍
bibme.org

michael-spengler commented 3 years ago

very valuable feature: wie oft wurde welche quelle zitiert?... --> future: in welchen papers wurden welche quelle zitiert... / und wo wurden diese papers wiederum veröffentlicht --> als zusätzliche Qualitätsindikatoren...

michael-spengler commented 3 years ago

network display who cited what? ....

michael-spengler commented 3 years ago

Feedbacks vom 25.6.21

Top: Nach wie vor top use case...

Optimize: README.md optimieren...
Long Term Deployment (frontend via svelte --> github pages based deployment)

Brebeck-Jan commented 3 years ago

Frontend via Svelte umgesetzt! :D

michael-spengler commented 3 years ago

long term availability: http://85.214.28.167:5001/

SimonScapan commented 3 years ago

Changes commited to REPO:

Title now "SPH"
All functionalities checked
Project long term availability is stable

michael-spengler / wwi18dsb-semester-6

Web Harvesting using testing frameworks like CodeceptJS #5