Open imwatsi opened 5 years ago
Hi @imwatsi, there seems to be an issue with Wikipedia links not using common names when needed (Sputnik 1) so quite a few of the links are missing. Also i believe the code hung up on 43137? There are other problems where it can’t find the title and line 70 throws an error.
Could you take one last look and see if you can resolve?
Appreciate the work so far!
-MainNebula
Hi @mainnebula
Okay, I will debug. In the meantime, could you assist in adding me to the bounties. I would also appreciate a review of the categorize.py
script. Did you run into any errors while running it? Or is there something you want changed?
Thanks.
@imwatsi you should be able to add yourself to it now:
Check out this bounty that pays out 1.5 ETH https://gitcoin.co/issue/bWFpbm5lYnVsYVg5NmdSQVZ2d3g1MnVTNnc0UVlDVUhSZlIzT2FvQjE2NjAy #python #sql #backend
I'm running the code throughout the entire list to take note of the errors above. I'll update you on my progress when I'm done resolving the issues.
@mainnebula Here's an improved version...
About that line 70 error that occurred when it couldn't find a title: I put a print
statement for the exception so that we can get more details if the error occurs again. I think the the changes I made prevent that error from coming up though.
Hi @imwatsi,
2,404 links that are generated route to : https://en.wikipedia.org/wiki/Deb
Do you know why? Other than that it looks good.
Thanks!
Hi @mainnebula
FIXED.
I isolated specific entries that had this issue and saw that the links were generated because of a faulty method used to extract satellite name and id_number. It would take "DEB" as the satellite name, and sometimes "NEEDLES" which also sent links to https://en.wikipedia.org/wiki/Needle. This happened in entries that had no flag values, e.g.*D
or D
. An offset error.
My apologies for this.
I replaced it with a simpler, more reliable method. You can try it out now.
Any other issues/modifications for this script? @mainnebula
Issue #1
My script scrapes and validates links to the first three sources stated in the issue. Where a valid link is found, an entry is made and saved to DB.
I could not find a methodical approach to extract links from the 4th source, because there wasn't a consistent way to lookup objects.
Configuration
SQL_CONFIG
RATE_LIMIT_DELAY
is in seconds, it's a pause at each initiation of a new thread that makes 3 http calls to the 3 sourcesRATE_LIMIT_DELAY
of0.1
means only 10 new threads are created per second, with a total of 30 different http callsDependencies
pip install mysql-connector-python
pip install beautifulsoup4
pip install requests
DB Schema
satellites
links