lsst-epo / citizen-science-notebooks

A collection Jupyter notebooks that can be used to associate Rubin Science Platform data to a Zooniverse citizen science project.
3 stars 1 forks source link

Downloading data from Zooniverse; classification_export.status_code == 403 error #38

Open beckynevin opened 1 year ago

beckynevin commented 1 year ago

Describe the bug The last cell of the citizen science notebook (the one that grabs the classifications from Zooniverse using panoptes client) fails every 10th time it runs.

To Reproduce Steps to reproduce the behavior, written in imperative mood:

  1. Restart the kernel
  2. Scroll down to the last cell in citizen science notebook
  3. Run the cell
  4. Follow the directions to log in with your Zooniverse credentials.
  5. Sometimes it works, continue restarting the kernel and rerunning until you see the error.

Expected behavior That there be no error with downloading the classifications. In other words, classification_export.status_code == 200 and classification_export.ok == True.

Actual behavior Sometimes (again only ~10th time this is run), classification_export.status_code == 403.

Screenshots

EDC Output

INPUT
# This cell is set up to run independently from all of the above cells
import panoptes_client, utils
panoptes_client.Panoptes.connect(login="interactive")
# This project_id is found on Zooniverse by selecting 'build a project' and then selecting the project
# You don't need to be the project owner.
project_id = 19539
classification_export = panoptes_client.Project(project_id).get_export('classifications')
list_rows = []
counter = 0
# If the following line throws an error, restart the kernel and rerun the cell.
for row in classification_export.csv_reader():
    if counter == 0:
        header = row
    else:
        list_rows.append(row)
    counter += 1
df = utils.pandas.DataFrame(list_rows, columns = header)
df

SAMPLE OUTPUT
Enter your Zooniverse credentials...
Username:  rebecca.nevin
 ········
---------------------------------------------------------------------------
Error                                     Traceback (most recent call last)
Input In [1], in <cell line: 14>()
     10 counter = 0
     11 # I get a weird error if I run the rest of this notebook first and don't rerun the import and call
     12 # to panoptes_client above: 
     13 # Error: iterator should return strings, not bytes (the file should be opened in text mode)
---> 14 for row in classification_export.csv_reader():
     16     if counter == 0:
     17         header = row

Error: iterator should return strings, not bytes (the file should be opened in text mode)

Additional context Here is the code we wrote that bypasses this issue. We are not including this in the alpha version of the code release, but we'd like to include it down the road. Currently, we just have one comment that recommends re-running the cell if it fails.

# I currently have this cell set up to run independently from all of the above cells
#from panoptes_client import Panoptes, Project
import panoptes_client, utils
panoptes_client.Panoptes.connect(login="interactive")
# This project_id is found on Zooniverse by selecting 'build a project' and then selecting the project
# I also don't think you need to be the project owner, but I'm not sure
project_id = 19539
classification_export = panoptes_client.Project(project_id).get_export('classifications')
list_rows = []
counter = 0
# I get a weird error if I run the rest of this notebook first and don't rerun the import and call
# to panoptes_client above: 
# Error: iterator should return strings, not bytes (the file should be opened in text mode)
if classification_export.status_code == 200 and classification_export.ok == True:
    for row in classification_export.csv_reader():

        if counter == 0:
            header = row
        else:
            #print(row)
            list_rows.append(row)
        counter += 1

    df = utils.pandas.DataFrame(list_rows, columns = header)
    print(df)
elif classification_export.status_code == 403:
    print("There was an issue with the request, please try again in a minute.")
else:
    print(classification_export.status_code)
    print(classification_export.text)
clareh commented 1 year ago

had a discussion with someone who is keen to use our pipeline down the road and they raised the concern about the delay for getting results. They think the ~24 hour wait to get classifications will impact their ability to do science... worth discussing in this context perhaps?

bnord commented 1 year ago

@clareh What specific concerns did they have about the delay? Why does 24-hour delay affect their science capacity?

beckynevin commented 1 year ago

Maybe the above two comments should be attached to a separate discussion? They seem not related to this issue/bug but seem related to the general discussion topic of how to fetch data.

bnord commented 1 year ago

@clareh Could you start an issue or a new discussion on this?

eatyourgreens commented 1 year ago

Hi! I've added myself to this as the Zooniverse contact.

My first thought is that perhaps the failed requests are using expired Authorization headers but I will investigate.

ericdrosas87 commented 1 year ago

Thank you @eatyourgreens !

eatyourgreens commented 1 year ago

Hi again,

Do you know if the classification export is being requested after its signed URL has expired? Here's an example of an expired link: https://panoptesuploads.blob.core.windows.net/private/project_classifications_export/2659a7c3-043d-45c7-8cef-c0fbae185cc5.csv?sp=r&sv=2018-11-09&se=2023-06-07T22%3A08%3A14Z&sr=b&sig=rnOa82WJhSROjG61If1qZ0QLIGcHT3KADJptlQB%2BoAE%3D

The URLs expire 3 minutes after they're generated, so maybe that's the cause of the problem?

If the signed URL has expired, I think that you need to retry and generate a new URL.

eatyourgreens commented 1 year ago

https://github.com/zooniverse/panoptes/pull/4209 might fix this, once it’s deployed to Panoptes production.

Credit to @yuenmichelle1 for figuring out the caching problem: those classification links are good for 3 minutes but Panoptes caches for 5 minutes, so there's a 2 minute overlap where Panoptes can give you an expired link.

ericdrosas87 commented 1 year ago

Thank you for the update @eatyourgreens, we'll retest soon