mainnebula / SPACE_TASKS

1 stars 3 forks source link

Verified links generator script #9

Open imwatsi opened 5 years ago

imwatsi commented 5 years ago

Issue #1

My script scrapes and validates links to the first three sources stated in the issue. Where a valid link is found, an entry is made and saved to DB.

  1. Gunters Satellite Page (ordered scraping may not be possible)

I could not find a methodical approach to extract links from the 4th source, because there wasn't a consistent way to lookup objects.

Configuration

Dependencies

pip install mysql-connector-python pip install beautifulsoup4 pip install requests

DB Schema


links
    obj_no (NO NULL)
    nssdc
    celestrak
    wikipedia
mainnebula commented 5 years ago

Hi @imwatsi, there seems to be an issue with Wikipedia links not using common names when needed (Sputnik 1) so quite a few of the links are missing. Also i believe the code hung up on 43137? There are other problems where it can’t find the title and line 70 throws an error.

Could you take one last look and see if you can resolve?

Appreciate the work so far!

-MainNebula

imwatsi commented 5 years ago

Hi @mainnebula

Okay, I will debug. In the meantime, could you assist in adding me to the bounties. I would also appreciate a review of the categorize.py script. Did you run into any errors while running it? Or is there something you want changed?

Thanks.

mainnebula commented 5 years ago

@imwatsi you should be able to add yourself to it now:

Check out this bounty that pays out 1.5 ETH https://gitcoin.co/issue/bWFpbm5lYnVsYVg5NmdSQVZ2d3g1MnVTNnc0UVlDVUhSZlIzT2FvQjE2NjAy #python #sql #backend

imwatsi commented 5 years ago

I'm running the code throughout the entire list to take note of the errors above. I'll update you on my progress when I'm done resolving the issues.

imwatsi commented 5 years ago

@mainnebula Here's an improved version...

imwatsi commented 5 years ago

About that line 70 error that occurred when it couldn't find a title: I put a print statement for the exception so that we can get more details if the error occurs again. I think the the changes I made prevent that error from coming up though.

mainnebula commented 5 years ago

Hi @imwatsi,

2,404 links that are generated route to : https://en.wikipedia.org/wiki/Deb

Do you know why? Other than that it looks good.

Thanks!

imwatsi commented 5 years ago

Hi @mainnebula

FIXED.

I isolated specific entries that had this issue and saw that the links were generated because of a faulty method used to extract satellite name and id_number. It would take "DEB" as the satellite name, and sometimes "NEEDLES" which also sent links to https://en.wikipedia.org/wiki/Needle. This happened in entries that had no flag values, e.g.*D or D. An offset error.

My apologies for this.

I replaced it with a simpler, more reliable method. You can try it out now.

imwatsi commented 5 years ago

Any other issues/modifications for this script? @mainnebula