olivierhagolle / LANDSAT-Download

Automated download of LANDSAT data from USGS website
http://olivierhagolle.github.io/LANDSAT-Download
GNU General Public License v3.0
205 stars 96 forks source link

Added handling for csrftoken #24

Closed mkmitchell closed 8 years ago

mkmitchell commented 8 years ago

First pull request!

For some reason there was an extra space in some of the code so I removed those in the area I was working in. I'm not sure if this works for proxy but it worked fine for no_proxy.

olivierhagolle commented 8 years ago

Hi, sorry, but I have not been able to make it work. No file is downloaded, I just get a html dump which is a new login screen. But my account works when provided online Olivier

mkmitchell commented 8 years ago

Hey, I've been getting that html dump as well before this error started happening. I had to add a sys.exit() after a product wasn't found or it would just keep looping the dump.

Does it look something like this? " python F:\Landsat\download_landsat_scene.py -o catalog -z -b LC8 -u usgs.txt -d 20160201 -f 20160331 --output F:\Landsat\Download --outputcatalogs=F:\Landsat\Download -c 50 -s 028035 None None Verifying catalog metadata files... Searching for images in catalog... erreur : le fichier est au format html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" " http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

EROS Registration System (ERS)
<body>
    <!-- BEGIN USGS Header Template -->
    <div id="titleBar">
        EROS Registration System (ERS)        </div>
    <div id="pageContent">
        <script type="text/javascript">
$(document).ready(function()
{
    $('#loginButton').button();
    $('.oAuthType').click(function()
    {
            window.location = 'https://ers.cr.usgs.gov/login/oauth/' +

$(this).attr('data-serviceKey'); }); $('#registerButton').button().click(function() { document.location = 'https://ers.cr.usgs.gov/register'; }); $('#username').focus(); });

ERS consolidates user profile and authentication for all EROS web services into a single independent application.

Sign In

sign in with your existing USGS registered username and password
 
Don't have an account?


OMB number 1028-0119
OMB expiration date 06/30/2019

ΓÇï Privacy and Paperwork Reduction Act statements: 16 U.S.C. 1a7 authorized collection of this information. This information will be used by the U.S. Geological Survey to better serve the public. Response to this request is voluntary. No action may be taken against you for refusing to supply the information requested. The time required to complete this information collection is estimated to average 5 minutes per response. We will not distribute responses associated with you as an individual. We ask you for some basic organizational and contact information to help us interpret the results and, if needed, to contact you for clarification. Comments on this collection should be sent to custserv@usgs.gov.ΓÇï
    </div>
    <!-- BEGIN USGS Footer Template -->

product LC80280352016035LGN00 not found " On Wed, Aug 10, 2016 at 8:45 AM Olivier Hagolle notifications@github.com wrote: > Hi, sorry, but I have not been able to make it work. > No file is downloaded, I just get a html dump which is a new login screen. > But my account works when provided online > Olivier > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > https://github.com/olivierhagolle/LANDSAT-Download/pull/24#issuecomment-238870851, > or mute the thread > https://github.com/notifications/unsubscribe-auth/AEyBTNWa5t4zNYDOO9eX0Of19oEv7vLMks5qedXggaJpZM4JhHBJ > .
mkmitchell commented 8 years ago

I tested one that I knew should work and got the same dump. I'll work on this in a bit. A few meetings first.

olivierhagolle commented 8 years ago

yes, the tool provides the dump if it gets an html file instead of a LANDSAT file. Means it is not working.

I have replaced the dump by a log of the html file in an error file which is more convenient. You might want to replace the block from if (req.info().gettype()=='text/html'): by

if (req.info().gettype()=='text/html'):
      print "error : file is in html format, and shouldn't"
      lignes=req.read()
      if lignes.find('Download Not Found')>0 :
            raise TypeError
      else:
          with open("error_output.html","w") as f:
              f.write(lignes)
              print "result saved in ./error_output.html"
              print sys.exit(-1)
mkmitchell commented 8 years ago

I'm pretty new at all this fun web stuff. It appears I'm not getting the token correctly. I'll keep working on it.

mkmitchell commented 8 years ago

I'm not sure if this will help you but I successfully requested the website, got a cookie PHPSESSID then sent another request but still get the 403 error. I pulled the other cookie information from fiddler and copied it in with no luck.

` log_data = { 'username': usgs['account'], 'password': usgs['passwd'] } post_data = urllib.urlencode(log_data)

cookjar = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookjar))
headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:44.0) Gecko/20100101 Firefox/44.0',
        'Host': 'ers.cr.usgs.gov:443',
        'Connection': 'keep-alive'}
req = urllib2.Request("https://ers.cr.usgs.gov/login")
f = opener.open(req)
data = dict((cookie.name, cookie.value) for cookie in cookjar)
headers = {'Host': 'ers.cr.usgs.gov',
    'Connection': 'keep-alive',
    'Cache-Control': 'max-age=0',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2780.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, sdch, br',
    'Accept-Language': 'en-US,en;q=0.8',
    'Cookie:': 'PHPSESSID='+data["PHPSESSID"]}
headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:44.0) Gecko/20100101 Firefox/44.0',
    'Host': 'ers.cr.usgs.gov',
    'Connection': 'keep-alive',
    'Accept-Encoding': 'gzip, deflate, sdch, br',
    'Cookie:': 'PHPSESSID='+data["PHPSESSID"] + '; ee-system-notices=%5B%225261%22%2C%225141%22%5D; _ga=GA1.2.1349842508.1469734657; _gat_lta=1'}
req = urllib2.Request("https://ers.cr.usgs.gov/login", post_data, headers=headers)
f = opener.open(req)

`

mkmitchell commented 8 years ago

Alright. I fixed it for real this time.

mkmitchell commented 8 years ago

Alright. I had to add the need for beautifulsoup because the way I got the csrf_token was to parse the html. Please hack away at it to make it meet your needs. I was exploring grabbing headers from the site and passing them in. The odd thing is that the headers I'm passing is completely empty but if I remote sending them it won't work.

mkmitchell commented 8 years ago

I committed again and removed the header stuff and just send headers={}. This removed the need for itertools.

olivierhagolle commented 8 years ago

Great, it works ! The only drawback is that we have to download BeautifulSoup. Thanks a lot Mike. Olivier