urlstechie / urlchecker-analysis

An automated analysis to assess url correctness in a subset of repositories (under development)
1 stars 0 forks source link

List of projects that are good candidates for study #1

Open vsoch opened 4 years ago

vsoch commented 4 years ago

From issue urlstechie/urlchecker-python#13:

But here is an idea - what about some kind of fun project where we involve the research community, but do our own small analysis, and then that serves as a writeup that we can share on social media and encourage folks to use it? What I'd want to do is assemble a list of documentation served on repositories that are research oriented / groups, and then programmatically run the checker for all of them to calculate the number of total links, number of broken, etc. Actually, if we add the ability to specify an output file, we could even put the entire results into a data repository, include the scripts for running, and heck, if we can make an argument for a research tool for documentation (that has shown purpose and we have some hypothesis / conclusions about links) we might even have enough to at least submit a paper to JoSS (and then to ArXiv if it's rejected). What do you think? So my thinking for moving forward - after results can be saved to file:

This sounds like fun! I'm totally willing to take on the bulk of work stated above, I haven't done a little fun project like this in a while. Let me know your thoughts!


SuperKogito commented 4 years ago
vsoch commented 4 years ago

I am not sure if I understand your idea of a CI automated job to test over longer period of times? Isn't that the job of the GitHub action?

@SuperKogito let's say that we have a list of repos - we would have some repository, let's call it "urlchecker-analysis" that uses the GitHub action:

So you can imagine we would have a results structure something like this:

 # urlchecker-analysis
results/
    repo-checked-1     # this might be the research meeting list repo, for example
        results-<date-1>.csv
        results-<date-2>.csv
       ....
    repo-checked-2
    ...
    repo-checked-n 

And then you can imagine having an analysis script that can be run over any specific repository checked, and say things like "The percentage of urls broken on average is... the change from week to week is..." and more importantly, if we get enough repos, we might even be able to say things in a larger sense like "We found repos associated with this domain, or repos that were updated only this many times, had significantly more broken links." And of course that requires having metadata about the repos, which is something else we can get from the GitHub API, etc. But that's a later step, we can focus on first:

  1. collecting a list of urls
  2. creating the urlchecker-analysis with a GitHub workflow to use urlchecker-action, once per repository, to do the checks
  3. and then automating it to run weekly to save results

And then we can play around with developing the analysis bit when there is a tiny bit of data. I suspect that most repos won't have huge changes day to day, which is why I'm thinking the rate of monthly might be a good start.

And then once we have this analysis, we can write it up, make pretty plots, and give good reason to do the checks in the first place!

vsoch commented 4 years ago

For the badges - definitely give it a go! Please again open feature branches for review first. I've made custom badges (I think with shields.io?) Here are a few purple ones I designed for the needs-love project :) https://github.com/rseng/needs-love

SuperKogito commented 4 years ago

urlchecker-analysis, I love it. The whole concept, that's a genius idea <3 I will see which repository urls we can use :)

vsoch commented 4 years ago

awesome! If you want to put together a first shot at a list, I can put together the skeleton of the repo (I've already thought about it a bit).

SuperKogito commented 4 years ago

go on with the repo and I will add a list to it? or maybe better to put it here? I will try to make it, at the latest by tomorrow.

vsoch commented 4 years ago

Just put it here since we have the nice issue :)

vsoch commented 4 years ago

Actually even better - I can make the repo and transfer the issue! <3

vsoch commented 4 years ago

Done!

SuperKogito commented 4 years ago

So after searching a bit and checking some projects, I came up with the list below. The projects listed below were not chosen for any specific criteria. I just tried to diversify the repositories (Python, JS, Html) but it is still missing others (c., c++ etc.). I also tried to include projects that are currently maintained and include many links.

Python projects

Js projects and websites.

This a list of various active projects of interest with many links.

Curated lists repositories are very interesting for us

This will help us test urlchecker with .md files

Academic projects and courses

HTML documentations

SuperKogito commented 4 years ago

Let me know what you think of it and which ones we should add ;)

vsoch commented 4 years ago

These are great! I don't see why we shouldn't add all of them? It's a very nice range of types of repos.

vsoch commented 4 years ago

I need to finish up working on an API, but after that I should be able to put some time into this! If not today, definitely this week.