imwatsi commented 5 years ago

Issue #1

My script scrapes and validates links to the first three sources stated in the issue. Where a valid link is found, an entry is made and saved to DB.

Gunters Satellite Page (ordered scraping may not be possible)

I could not find a methodical approach to extract links from the 4th source, because there wasn't a consistent way to lookup objects.

Configuration

To scan all satellites, comment out line #170 as it limits results to first 200 only (used during testing)
Username and password can be changed in SQL_CONFIG
RATE_LIMIT_DELAY is in seconds, it's a pause at each initiation of a new thread that makes 3 http calls to the 3 sources
A RATE_LIMIT_DELAY of 0.1 means only 10 new threads are created per second, with a total of 30 different http calls

Dependencies

pip install mysql-connector-python pip install beautifulsoup4 pip install requests

DB Schema

Database name: satellites
Table: links


links
    obj_no (NO NULL)
    nssdc
    celestrak
    wikipedia

mainnebula commented 5 years ago

Hi @imwatsi, there seems to be an issue with Wikipedia links not using common names when needed (Sputnik 1) so quite a few of the links are missing. Also i believe the code hung up on 43137? There are other problems where it can’t find the title and line 70 throws an error.

Could you take one last look and see if you can resolve?

Appreciate the work so far!

-MainNebula

imwatsi commented 5 years ago

Hi @mainnebula

Okay, I will debug. In the meantime, could you assist in adding me to the bounties. I would also appreciate a review of the categorize.py script. Did you run into any errors while running it? Or is there something you want changed?

Thanks.

mainnebula commented 5 years ago

@imwatsi you should be able to add yourself to it now:

Check out this bounty that pays out 1.5 ETH https://gitcoin.co/issue/bWFpbm5lYnVsYVg5NmdSQVZ2d3g1MnVTNnc0UVlDVUhSZlIzT2FvQjE2NjAy #python #sql #backend

imwatsi commented 5 years ago

I'm running the code throughout the entire list to take note of the errors above. I'll update you on my progress when I'm done resolving the issues.

imwatsi commented 5 years ago

@mainnebula Here's an improved version...

Added better error handling for http requests
Requests are tried up to 10 times, the most I got while testing was 3 retries due to momentary breaks in connectivity
Improved wikipedia lookups by using both Object Number and Satellite name in queries

imwatsi commented 5 years ago

About that line 70 error that occurred when it couldn't find a title: I put a print statement for the exception so that we can get more details if the error occurs again. I think the the changes I made prevent that error from coming up though.

mainnebula commented 5 years ago

Hi @imwatsi,

2,404 links that are generated route to : https://en.wikipedia.org/wiki/Deb

Do you know why? Other than that it looks good.

Thanks!

imwatsi commented 5 years ago

Hi @mainnebula

FIXED.

I isolated specific entries that had this issue and saw that the links were generated because of a faulty method used to extract satellite name and id_number. It would take "DEB" as the satellite name, and sometimes "NEEDLES" which also sent links to https://en.wikipedia.org/wiki/Needle. This happened in entries that had no flag values, e.g.*D or D. An offset error.

My apologies for this.

I replaced it with a simpler, more reliable method. You can try it out now.

imwatsi commented 5 years ago

Any other issues/modifications for this script? @mainnebula

mainnebula / SPACE_TASKS

Verified links generator script #9

Configuration

Dependencies

DB Schema