urlstechie / urlchecker-python

:snake: :link: Python module and client for checking URLs
https://urlchecker-python.readthedocs.io
MIT License
20 stars 13 forks source link

adding fix for trailing } on urls #43

Closed vsoch closed 4 years ago

vsoch commented 4 years ago

This will fix #42 by way of adding back the original URL regex, to be run across urls individually after they are cleaned for }. The bug is that this cleaning was removed but not added back after the custom parsing. The FINAL_REGEX could be made to include a combination of the domain_extensions, but this is the quickest / easiest solution for now (representing the entire string for the regex).

Signed-off-by: vsoch vsochat@stanford.edu

codecov[bot] commented 4 years ago

Codecov Report

Merging #43 into master will increase coverage by 0.05%. The diff coverage is 84.61%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #43      +/-   ##
==========================================
+ Coverage   77.67%   77.72%   +0.05%     
==========================================
  Files          20       20              
  Lines         645      651       +6     
==========================================
+ Hits          501      506       +5     
- Misses        144      145       +1     
Impacted Files Coverage Δ
urlchecker/core/check.py 83.33% <0.00%> (ø)
urlchecker/core/fileproc.py 91.30% <85.71%> (+1.06%) :arrow_up:
tests/test_core_urlproc.py 100.00% <100.00%> (ø)
urlchecker/core/urlmarker.py 100.00% <100.00%> (ø)
urlchecker/version.py 100.00% <100.00%> (ø)
urlchecker/core/urlproc.py 92.47% <0.00%> (-1.08%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update af9d7ac...7cb765a. Read the comment docs.

SuperKogito commented 4 years ago

This seems like an okay fix for now but I will need to optimize this as it seems redundant. I couldn't test the regex online so can you provide a link for your test-project so I can check it manually.

vsoch commented 4 years ago

Sure! We can definitely "optimize the string" if you think this is important - I think it's actually okay, since it's all clearly printed there for the developer to see, and we can be sure it's exactly what we had before. The test file I used was a jobs.yml file, here:

- {expires: 2020-06-10, location: 'Pasadena, CA', name: Research Software Engineer
    at Caltech, url: 'https://www.linkedin.com/jobs/view/1689896027/'}
- {expires: 2020-06-10, location: 'Mountain View, CA', name: Research Software Engineer
    at Google, url: 'https://www.linkedin.com/jobs/view/1817277202/'}
- {expires: 2020-06-10, location: 'Atlanta, GA', name: 'Architect, PACE team at Georgia
    Institute of Technology', url: 'https://pace.gatech.edu/xsedeosg-architect'}
- {expires: 2020-06-10, location: 'Atlanta, GA', name: 'Scheduler Architect, PACE
    team at Georgia Institute of Technology', url: 'https://pace.gatech.edu/scheduler-architect'}
- {expires: 2020-06-10, location: 'Socorro, NM', name: Staff Scientist/Software Engineer
    at New Mexico Tech, url: 'https://www.nmt.edu/hr/docs/hr/jobs/StaffSciSoftEngIRIS20-030.pdf'}
- {expires: 2020-04-10, location: 'Boston, MA', name: Senior Research Computing Systems
    Engineer at Boston University, url: 'http://www.bu.edu/tech/support/research/rcs/jobs/'}
- {expires: 2020-03-10, location: Caltech, name: 'Software Engineer/Research Software
    Engineer: CliMA project', url: 'https://clima.caltech.edu/join-our-team/'}
- {expires: 2020-02-15, location: NumFOCUS, name: Research Software Engineering Fellow,
  url: 'https://numfocus.org/blog/now-hiring-matplotlib-research-software-engineering-fellow'}
- {expires: 2020-02-01, location: Pittsburgh Supercomputing Center, name: Research
    Software Specialist, url: 'https://www.psc.edu/about-psc/employment/3116-research-software-specialist-2014428'}
- {expires: 2020-03-01, location: University of Alabama, name: Research Software Engineer,
  url: 'http://carver.cs.ua.edu/RSE.htm'}
- {expires: 2020-02-01, location: University of Illinois at Urbana-Champaign, name: Visiting
    Research Scientist, url: 'https://jobs.illinois.edu/academic-job-board/job-details?jobID=123477&job=visiting-research-scientist-school-of-information-sciences-123477'}
- {expires: 2019-10-01, location: 'Boston, MA', name: Senior Scientific Programmer
    at Boston University, url: 'http://www.bu.edu/tech/support/research/rcs/jobs/'}
- {expires: 2020-06-30, location: 'Cambridge, MA', name: Senior Research Software
    Engineer (Data Engineer), url: 'http://bit.ly/fasrc_senior_rse'}
- {expires: 2019-09-01, location: University Park Campus, name: Software Consultants,
  url: 'https://psu.jobs/job/88890'}
- {expires: 2019-10-01, location: University of Arizona, name: Research Programmer,
  url: 'https://uacareers.com/postings/40792'}
- {expires: 2019-10-01, location: Indiana University Bloomington, name: Research Software
    Engineer / Application Support Engineer, url: 'https://brainlife.io/docs/careers/jobs/'}
- {expires: 2019-10-01, location: Indiana University Bloomington, name: Research Software
    Engineer / UI Engineer, url: 'https://brainlife.io/docs/careers/jobs/'}
- {expires: 2019-10-01, location: Indiana University Bloomington, name: Community
    Developer / Outreach Coordinator, url: 'https://brainlife.io/docs/careers/jobs/'}
- {expires: 2019-10-01, location: Indiana University Bloomington, name: DevOps / Software
    Engineers, url: 'https://brainlife.io/docs/careers/jobs/'}
- {expires: 2019-10-01, location: Indiana University Bloomington, name: Cloud and
    Cluster Administrator, url: 'https://brainlife.io/docs/careers/jobs/'}
- {expires: 2019-11-01, location: Brown University, name: Research Software Engineer/Senior
    Research Software Engineer, url: 'https://brown.wd5.myworkdayjobs.com/en-US/staff-careers-brown/job/180-George-Street/Research-Software-Engineer-Senior-Research-Software-Engineer_REQ162388'}
- {expires: 2020-01-15, location: 'Princess Margaret Cancer Centre, Toronto, ON, Canada',
  name: 'Technical Software Developer, Data and Distributed Computing, CanDIG', url: 'https://www.recruitingsite.com/csbsites/uhncareers/JobDescription.asp?SiteID=10031&JobNumber=852516'}

Then ran like

urlchecker check --files jobs.yml --file-types .yml  .

And something like this should be added to the test repo to ensure it doesn't trigger again (you said you were working on this, or not anymore?)

SuperKogito commented 4 years ago

Sorry it took me some time to test this, I thought I can find a nice hack to fix this by editing urls = [url for url in urls if not re.search("(\\{[a-z0-9.]*})", url)] but it seems that I will need some time to test that better. For now since this is a bug, I think you wrote a nice fix and as you said -we know that this works- so let's use this for now. Concerning the testing, I am still on it. Hopefully this weekend, I will do it right this time. Unfortunately I cannot do much during the week days, since I have a lot to do at work atm and when I come home, I am basically done with coding.

vsoch commented 4 years ago

Thank you for the speedy review! No worries on the tests, I'll link the comment here to the issue so you can easily find the file and copy paste.

I'm going to draft a new release for urlchecker-action now with the fix.