Dataset collection improvements

Based on what we discussed on the discord call, I will be looking more into dataset improvements. The scraper already works well, as mentioned from TODOs just needs to be expanded. Also some related to other tasks, like Non-LangChain library, we want to use for this repo. This will help with the filtering tasks below.

Some thoughts as well, is better for this focus enriching our arxiv dataset first (filtering, search, etc) or expanding our dataset as much possible? (like the non-arxiv papers).

I'm open to any ideas regarding the direction to take with the dataset section.

These tasks I will start look into:

[ ] Cron job to download new published paper or github-actions
[ ] Need to write a crawler/scrapper to get data directly from arxiv

Taken from the main page

[ ] Need to write a crawler/scrapper to get data directly from arxiv (@mnm-matin will push a small prototype script)
[ ] Need to be able to search/filter for AI Safety papers on arxiv
[ ] Download papers/extract abstract from those papers and any relevant tags
[ ] Cron job to download new published paper or github-actions
[ ] extract citations for each paper?
[ ] extend to non-arxiv papers, this project

Best, Tobi

mnm-matin / ai_alignment_graph

Dataset collection improvements #16

Taken from the main page