juliema / label_reconciliations

Code for reconciling multiple transcriptions for a label
MIT License
26 stars 11 forks source link

Add subject Zooniverse URL to output #56

Open denslowm opened 5 years ago

denslowm commented 5 years ago

Here is an example URL:

https://panoptes-uploads.zooniverse.org/production/subject_location/1aed06fa-1ade-46ca-9f01-5edbb16ee1a1.jpeg

It would be great to have it in the raw and reconciled files.

PmasonFF commented 5 years ago

Unfortunately the zooniverse url for a subject does not appear anywhere in the classification export from zooniverse. Options are - use panoptes_client to request the urls directly from zooniverse as an add-on to the reconcile step, require a subject export (which has the url's) be made available along side the data export so an add-on could obtain them from there. Or run a secondary script to obtain and add the url's to the raw and reconciled files as post-processing. The last option is relatively simple. I have used reconcile.py for a number of non-NfN projects where we use the csv format option on a flattened data file where the subject metadata have been pulled out of the subject data column into their own fields. These flow through and appear in the raw and reconciled files in their own columns. It would be relatively trivial to add the url obtained in either of the fore-mentioned ways to this approach.

denslowm commented 5 years ago

OK, I am definitely not sure the best way to do this at this point. Could you share more about how the secondary script approach works?

PmasonFF commented 5 years ago

Once you have run reconcile.py you have a html summary file and second file with the best guess reconciled results. I am suggesting you would run a second script that would pull the url from either a subject export for the project in question, or by querying zooniverse directly online using the panoptes-client, and add the url in a column to the two files from reconcile.py.

This keeps reconcile.py clean and unaffected, though the second script could be a routine added into reconcile.py, possibly as an option. I am not thrilled with the suggestion to incorporate it into reconcile.py for fear it will make other uses of the reconciler more difficult for others. I am sure I could write a simple script for the csv file of reconciled results but I have no html experience to add it to the summary.html file.

I am a little surprised that none of the metadata uploaded with the subjects now is pulled out into the final files - As it stands the only linkage from the reconciled or summary files to the real world info for each subject is the zooniverse subject number. Most of the projects I work with want some metadata field as the main key to the flattened and aggregated data rather than the zooniverse subject number. Most want several metadata fields added to the flattened and aggregated files. A secondary script could retrieve specified metadata fields from your NfN subjects and output them in the summary and reconciled files as well.

After I wrote this I realized that for nfn formatted data files reconcile.py does recover metadata fields from the subject data.

PmasonFF commented 5 years ago

Here is a simple stand-alone script that will add a column to any reconciled.csv file from Notes from Natures reconcile.py with the subject location (either zooniverse url or external url for each subject) At present I do not have the skills to add a similar field to the summary.html file. You can see it is very easy to request the urls from zooniverse and I expect rafe could add this functionality to the reconciler in a heartbeat. This script does require the package panoptes_client to be installed in the environment! This is fairly straight forward for Mac or Linux, less so for Windows where a standard install will throw a warning, but for this application, will still work fine. https://github.com/PmasonFF/Zooniverse-data-digging/blob/master/Transcriptiontasks/add_locations_to_reconciled_file.py

denslowm commented 5 years ago

Thanks PmasonFF. I'll test this out when I can. I'd still like to discuss the first option further with the team as I think the maximum value for our data providers will be to include this information in the output.

rafelafrance commented 5 years ago

Given the incomplete subject data returned to us, the above utility does what we need it to do. I wrote a similar one to get images for the Label Babel project.

Note: I really don't want to crawl panoptes just to get the URLs.

PmasonFF commented 5 years ago

An option to crawling panoptes for the urls would be to require a subject export along with the classification data and pull it the urls from there. Or perhaps add a parameter to run the utility automatically once the reconcile.csv file is built?

rafelafrance commented 5 years ago

Or put the URLs into the subject data?

PmasonFF commented 5 years ago

That is possible but again requires either a uploader that first creates and uploads the subject then finds the subject _id and updates the metadata with the subject locations, or a stand alone script that updates the metadata with the subject_locations. The problem is the subject_id and the subject url (locations) are not known until the subject are created and uploaded, so it is not something one can do in one step through the CLI.

rafelafrance commented 5 years ago

I don't see how you could not know what the URLs are. The images are served via URL, annotated, and the annotations are assigned to the subject. This is an upstream problem.

PmasonFF commented 5 years ago

The subject data in the classification export is derived from the metadata uploaded with the subject, and to get a info in there it has to be in the subject metadata. The subject urls are in the subject export but not the classification export. And it's a catch 22 - you don't know the either the subject.id or the subject.locations when you upload the images and whatever metadata you have. So one has to go back and add the subject.locations to the metadata AFTER the uploading process and the subject has been saved. You can not even ask for the subject.locations immediately after saving the subject, though at that point you can get the subject.id. To get the url you have to open a second instance of the new subject, find the url for the second instance, update the metadata, and then save the second instance of the subject. l have done this both ways - as part of a uploader script and as a stand alone script. I agree it would be the ideal way to go but then I recommend uploader scripts over the CLI in any case, but not everyone agrees.

denslowm commented 5 years ago

Thanks to you both!

A few things here. This is really an ask for two reasons. It's nice to see what the volunteers actually saw (in a convenient way). Mismatches in the interface were actually a problem at one time so this allowed us to verify things when we saw anomalies in the transcription data. It's also super helpful to have when our data providers are reviewing the outputs/making corrections before they load data into their local databases. Remember, I am not the one usually working with the data here, so trying to lower the bar as much as I can.

Yes, one can look up the images in several different ways, but not all are very efficient.

We used to have the URLs in the output! I realize now that this was during NfN 1.0. I looked in the 2.0 data, but didn't see it, so it's apparently been some time. I am not sure exactly how they were harvested, but I now realize that the format of the URLs has changed significantly since that time. They were a predictable pattern and included the subject_id.

Unless you all have other ideas, I guess we can close this.

Thanks Peter for the scripts that you created as a possible solution. I can see me making use of those in the future.

PmasonFF commented 5 years ago

Who and how do NfN subjects get uploaded now? It would really not be any problem to customize an uploader for NfN that would do as Rafe suggests. Uploader scripts are good in that they are restartable if something goes wrong during the upload (with no duplicated subjects), and you get a summary of what did and did not upload. A number of projects are using them uploading thousands of images (10,000 ten image subjects for the skinks project last week in one go overnight), often without the need for a manifest. All that is necessary is an unambiguous link between the metadata that already exists in some csv file(s) that can connect back to the image file names so the script knows what goes where.

Alternately we could have a Python script that fetches and displays the subject image based on the subject_id. Ideally I think we need to pull the html summary, the reconciled data in an format where it can be edited, and the subject image together in one package presenting it subject by subject, for at least those subjects that did not reconcile well.

denslowm commented 5 years ago

I personally do almost all of the uploads. Sometimes with the CLI and mostly with the Project Builder.

I appreciate all these ideas, but again my concern is more on the part of the data providers who tend not to be real familiar with Python and need a super easy solution. We tend to process (reconcile) the data for them and then hope they can take it from there.

PmasonFF commented 5 years ago

In my attempts to help, I fear I may be becoming a pain. but one last shot.... It looks like the only way this is going to happen is with a stand alone script and it has to be run by you, MD. The options, to recap, appear to be the script already offered above which adds the url to only the reconciled file, or a script to upload subjects and add the url to the metadata, in which case reconcile.py will pull it from the subject data and add it to both the reconciled.csv and summary.html. A third option would replace the script above and add the urls to the subject metadata in a stand alone operation after they were uploaded. That also gets the urls into both the summary and the reconciled file. This would be the simplest of the three scripts to run, but incorporating it with a uploader has the biggest bang.

Would you consider looking at an uploader? Given a list of urls for high resolution images hosted somewhere, it is possible to get those images, automatically resize them (either to a pixel limit for maximum file size) and upload them as subjects with or without saving the low res version locally. I have that in hand for Worlds of Wonder now; it would be trivial to add the zooniverse subject urls to the metadata at the same time. Snapshots at Sea also automatically resizes their images during upload.

That's it - I leave it with you.

rbruhn commented 4 years ago

I Just came across this topic because we are building an "Expert Reconciliation" feature on Biospex. We are using --explanations argument when running our reconciliations and matching the "is_problem" pattern used in the summary. Then someone can go through the records, fix the problems, and download a "fixed" reconciled file.

Anyway, we show the image in the feature and I was pulling down the huge images from the accessURI of our subjects. Austin asked if there was something quicker. I ended up querying the API in our PHP to get the image location and add it to the database record. If it was ever included, it would be handy.

PmasonFF commented 1 year ago

The OP asked for the url to the subject itself, but it occurred to me that it is easy to provide a url to a zooniverse page that shows the subject and any discussions or comments about the subject. In fact one can find that page with just the subject id number by replacing the XXXXXXXX with the subject.id in the following address:

https://www.zooniverse.org/projects/aliburchard/generic-project/talk/subjects/XXXXXXXX

The part "aliburchard/generic-project" will suffice for displaying any subject from any project but an organization or project team could use their own project slug so that the header for the page is correct for the project in question.

This is probably the easiest way to see specific subjects with a known id.