openaustralia / theyvoteforyou

Making parliamentary voting information accessible, understandable, and easy to use so that you can hold your elected representatives to account.
https://theyvoteforyou.org.au/
Other
136 stars 30 forks source link

Link validation rake task #1309

Closed MutazAshhab closed 2 years ago

MutazAshhab commented 2 years ago

Basic proof of concept to check for old php links. PR is targeted towards issue #1304

Currently implemented to search for urls using regex rather than using redcarpet the markdown parser.

Tested locally and is able to find urls containing php in them.

Current output of this rake task:

image

Note: that https://theyvoteforyou.org.au/divisions/senate/2014-02-13/1 does not have this issue since I ran it locally without production data.

mlandauer commented 2 years ago

@Taz17 thanks for looking at this.

Perhaps I didn't make it clear enough in the original issue #1304 but I want the links to be validated by doing actual web requests to them to see if they work or error. So, I would expect the validator to extract the url, do a web request to that url. If it returns a 200 code (successful) then it's good. If it redirects the redirect should be followed and that url should be tested.

Also, where the same link comes up multiple times (in different divisions, for example) I want only one web request to be made. This is because there are some links that are used many, many times and I don't want a burst of traffic coming from us that isn't necessary.

The use of markdown renderer is necessary because these formats can be very flaky and use of regular expressions should be avoided unless really necessary. See for example https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/

What I suggest doing instead is using the markdown renderer to convert the text to html and then parsing the resulting html using nokogiri. Then you can search for links using nokogiri and extract the urls. That way of doing things should be very reliable and easy to understand though it will be somewhat computationally wasteful but I wouldn't worry about that much here. Clarity, reliability and maintainability trump other things.

mlandauer commented 2 years ago

@Taz17 in my eagerness to give you feedback and keep you moving I didn't say a couple of important things.

Thank you so much for diving in there and getting something quickly up and running.

Secondly I also really like that you made a "draft" PR to share your work in progress even though you knew it wasn't probably complete. That was so helpful because I could provide you some big picture feedback quickly and it gave me a really good idea of where you are up to. So thank you for that!

MutazAshhab commented 2 years ago

Run bundle exec rake application:links_valid:divisions to check for links validitiy in divisions summary.

Only outputs once a broken link is found.

Output looks like this:

image

To be done:

  1. Split URL into domain and path since Net::HTTP.get only requests the homepage of every domain.
  2. Check for re-direction and follow that URL.
MutazAshhab commented 2 years ago

As of the most recent commits, the rake task now checks for all broken links in markdown files found in a divisions description. Only one request is made if a URL appears more than once.

mlandauer commented 2 years ago

@Taz17 before I look through your changes can you make sure that all the CI tests pass. I can see that there's some failures which are caused by rubocop which is a ruby style / best practises checker. You can run it locally by doing bundle exec rubocop.

MutazAshhab commented 2 years ago

@mlandauer Ah yes, it was the rubocop that made the build fail. You should be able to take a look at it now.

mlandauer commented 2 years ago

@Taz17 is this ready to review again?

MutazAshhab commented 2 years ago

@mlandauer Yes!