juliema / label_reconciliations

Code for reconciling multiple transcriptions for a label
MIT License
26 stars 11 forks source link

Allow for automatic generation of --key-column ID fields for CSV and JSON input formats #58

Closed rafelafrance closed 1 year ago

rafelafrance commented 4 years ago

Handle the case where we receive a CSV file without an assigned primary key.

denslowm commented 4 years ago

Is this for a key not provided by the data provider or by the Zooniverse system? Do you have an example workflow? I just want to make sure this isn't cause by something I am doing.

rafelafrance commented 4 years ago

The NfN data is clean, this is for CSV files not in the NfN system.

PmasonFF commented 1 year ago

At one time there was a -f csv argument that allowed one to run reconcile.py on any .csv file, with columns to be reconciled listed as -c arguments. I was trying to bring a new project on to use reconcile.py and this option appears to no longer be available. This for me is a major problem. Over the years I have used this software for many transcription projects, almost always after the zooniverse data export was flattened and pre-processed extensively - ie not the raw data export with automatic extraction of the transcribed fields as NfN normally operates.

Trying to find the last release that had this feature is also a problem - firstly I do no see any versions between 4.3 and 5.0 are available, and secondly 4.3 no longer appears to work with available Python versions and packages. At this point I am trying to rebuild 0.4.8 with Python 3.8 for my project team.

Is the last version that supported the -f and -c arguments anywhere available?

And why on Earth was this option removed?

Peter

rafelafrance commented 1 year ago

I can put the options back for CSV & JSON files. Give me a workday or so.

PmasonFF commented 1 year ago

Thanks. At this point I am trying to build a working 4.8 version for the project team's Mac from the files I have on my computer which is Windows. I can copy all the relevant files to a new directory on my machine and then run it from there so hopefully the same file set will run on their machine.

Just for info I am working on a python based GUI for importing the reconciled and unreconciled flattened NfN export into a editor that allows the reconciled output to be directly edited - either directly using a text block for each reconciled field, copy and paste from any other source (such as another field) or selection of any one version of the unreconciled transcripts. I hope to link to a demo later this week. Here is a screen shot of the GUI as it stands - looks very much like the summary, except the white reconciled blocks are fully editable https://panoptes-uploads.zooniverse.org/subject_location/306e29c0-dc87-4bee-9593-f76f056d7bbd.jpeg, with the edits saved on "Submit".
Peter

rafelafrance commented 1 year ago

Well, I'm about 80% done with putting the CSV & JSON file formats back into the reconciler. I have to write code to validate the columns selections and then test everything before pushing. Use it or not, your call.

The GUI looks interesting.

denslowm commented 1 year ago

@PmasonFF A somewhat tangential question. Is this Notes from Nature data that you are trying to process? I ask because we process all expeditions as a matter of routine. We appreciate all you do to help with data issues around the Zooniverse so I'd like to make sure your time is being used efficiently and we don't have a miscommunication with a Provider on our end. Thanks

rafelafrance commented 1 year ago

I just pushed a new version that allows entry of CSV & JSON input files. You'll need to pull the new version.

PmasonFF commented 1 year ago

@rafelafrance This is great, thanks! I am pleased this support will continue - I would not want to get locked into an older unsupported version.
@denslowm While I would love to be of more help to Notes from Nature teams, and I hope the editor I am working on may be of use for them, I am primarily working with teams that have similar transcription tasks. Example WWI burial cards made extensive use of reconcile.py, as have some other projects in the past. The current one is not yet in Beta but is transcription from tables concerning fish sampling in the Mediterranean. Some of these projects have faltered at the stage where the reconciled data needs to be corrected for No match and onesies due to the effort involved (Danish Moths). There currently appears to be no easy way to bring the reconciled record, the actual transcriptions, and the zooniverse subject into one interface where editing can be easily done, similar to the OCR experiment, only for the standard transcription tasks. There was some previous discussion re Biospex doing something like this but I can find nothing on that. Peter

PmasonFF commented 1 year ago

Just getting back to this today. I set up a new Python 3.10 virtual environment, extracted the latest version of this repository and reran a reconcile.py on a very large flattened .csv file which processed correctly under version 0.4.8. The reconcile file built fine but I got this error and the summary file did not build: Traceback (most recent call last): File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\reconcile.py", line 215, in main() File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\reconcile.py", line 208, in main summary.report(args, unreconciled, reconciled) File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\pylib\summary.py", line 32, in report header=header_data(args, unreconciled, reconciled, transcribers), File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\pylib\summary.py", line 90, in header_data "title": f"Summary of '{args.workflow_name}' ({args.workflow_id})", AttributeError: 'Namespace' object has no attribute 'workflow_name'. Did you mean: 'workflow_csv'?

As well I was hoping the --explanations feature was still available - I am counting on using that to identify problems to correct in the editor I am building.

The editor is up and running but I wanted to ensure it worked with this latest reconcile.py. Hopefully I will post a git hub link by next week, and I want to make up a demo for Michael.

Peter

rafelafrance commented 1 year ago
  1. I just pushed an update that should handle the problem you were having with the summary.
  2. The explanations CSV is a bit more problematic. I hope to have an update for that on the next Friday that lands on a workday. FYI: There were massive changes recently and that CSV got sidelined. Time to put it back in.
  3. Would you please send along a test CSV that I can use to test changes, say, between 20 and 100 records. I don't like changing things in the dark.
  4. I look forward to seeing your GUI.
PmasonFF commented 1 year ago

This is a small file from a project which is going for review. It is a pure transcription project that could be handled with the -f nfn switch, but there is considerable variation in the symbols used for degree and minutes, that required preprocessing. I have included both the raw and preprocessed flattened versions, and the parameter string and results for the preprocessed file.

The raw flattened file: flatten_digging-up-the-oceans-past_class_sorted_raw.csv After preprocessing: flatten_digging-up-the-oceans-past_class_sorted.csv Parameters for after preprocessing: reconcile_ocean_past_parameters.txt Reconciled file using version 0.4.8 with --explanations switch on oceans-past_reconciled.csv

I can not attach html files but here is a link https://drive.google.com/drive/folders/1KTzCZTesO0U4LDUpx7Rp5axUhA7K-BiY?usp=sharing

If the expectation is that the --explanations will be available going forward I will proceed to set up a git for the editor tomorrow.

I certainly appreciate your effort. reconcile.py is a very useful script, Even if it is only to produce the summary user info for non- transcription projects it is a big help, and for transcription projects with short texts it is amazing!

Peter

PmasonFF commented 1 year ago

I have uploaded the editor for Nfn reconciled files here From the readme:

Editor for Reconciled NfN Transcripts

A Python based GUI for easy editing of Zooniverse transcriptin reconciled using NfN reconcile.py

This script takes as inputs the reconciled-with-explanations and the flattened unreconciled .csv files as produced by Notes from Nature's reconcile.py.

The editing GUI itself is patterned after the NfN Summary html template - except the reconciled result is fully editable using cut, paste, and copy keyboard commands from any field shown in the editor, direct character entry, deletion or replacement in the reconciled text block, or by simply selecting the best version of the actual transcriptions entered by the volunteers.

The subjects to be edited are retrieved from Zooniverse using the panoptes client, and shown with the the editing GUI for each subject selected for review.

Over the next day or two I will upload files suitable for a demo and hopefully links to various screen shots of the editor in action.

Meanwhile if there is any interest I am happy to help in any way I can. Peter

rafelafrance commented 1 year ago

I will definitely give this a try! Unfortunately this landed during the holiday so it may be a few days before I have any solid feedback.

I hope to land the explanations patch/resurrection next Friday :crossed_fingers:

rafelafrance commented 1 year ago

I added an --explanations option

PmasonFF commented 1 year ago

Having a few issues:

1)

For me the most serious issue is the explanations are being output on a second line for each subject_id

reconciled_sample_label_transcription.csv

I have to think about how I can work with this if it has to stay that way for the nfn teams.

2)

One I think I have a fix for: When reconciling a .csv file, getting following error from summary.py:

  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\reconcile.py", line 223, in <module>
    main()
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\reconcile.py", line 216, in main
    summary.report(args, unreconciled, reconciled)
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\pylib\summary.py", line 32, in report
    header=header_data(args, unreconciled, reconciled, transcribers),
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\pylib\summary.py", line 89, in header_data
    if args.workflow_name and args.workflow_id:
AttributeError: 'Namespace' object has no attribute 'workflow_name'. Did you mean: 'workflow_csv'?

The lines at issue are:

    title = f"Summary of '{args.workflow_csv}'"
    if args.workflow_name and args.workflow_id:
        title = f"Summary of '{args.workflow_name}' ({args.workflow_id})"

But workflow_name is no longer a parameter so we get the error. As well neither workflow_csv nor workflow_id are likely to be defined for csv files. In the past the .csv defsult title was from

'title': args.title if args.title else args.input_file,

so a proposed solution is

    if args.workflow_csv:
        title = f"Summary of '{args.workflow_csv}'"
    elif args.workflow_id:
        title = f"Summary of Workflow {args.workflow_id}"
    else:
        title = f"Summary of {args.input_file}"     

3)

When running a direct export from zooniverse ie a file similar to what comes from nfn, asking for all three output files - reconciled with explanations, unreconciled, and summary, I get the following error:

Traceback (most recent call last):
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\reconcile.py", line 223, in <module>
    main()
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\reconcile.py", line 207, in main
    unreconciled.to_csv(args.unreconciled)
  File "C:\py_scripts\Scripts_Reconcile_3.10_64bit\pylib\table.py", line 35, in to_csv
    keys = [args.group_by, "__order__"]
AttributeError: 'str' object has no attribute 'group_by'

The issue is line 207 is passing a specific argument (args.unreconciled - a string) where Table is expecting the Namespace. I am not sure how Table.to_csv is to output the unreconciled file at all... Perhaps build a specific method into Table to output unreconciled?

4)

Minor - In the summary I am getting odd results for the Transcriptions per transcriber "x" axis. This may be simply due to the very small numbers of transcribers in this sample export. Also the order of the columns is odd, with two columns from tasks to the right of the metadata fields.

sample_label_transcription_classifications.csv

I am sorry I am causing you so much trouble! I will play with this some more to see if i can find some fixes myself! Peter

rafelafrance commented 1 year ago

Edit: 2 & 3 were easy fixes. BTW: Thanks for the CSV attachment, that makes things easier.

  1. The explanations were done that way before. Seeing that you're the only one who wants the explanations in CSV form I can change it to be a separate CSV file if you want. It would simplify the code.
  2. ~Your proposed solution is, close to what I thought I had in there already. My bad, I'll fix.~
  3. ~I'll look into this. We definitely need a group_by column if you're going to reconcile/summary your data... It should be there already.~
  4. I'll look into it. The charting logic is ancient so there may be quirks with it.

I need an answer on 1, I can work on the others in the meantime.

PmasonFF commented 1 year ago

With older version 0.4.8 reconciled with explanations had the explanations in additional columns in the reconciled file, with one line per subject id even for files derived from raw data exports ie nfn format:

oceans-past_reconciled_nfn.csv

As of a few minutes ago, for .csv files produced by the 0.5.6 version with my tweak from above I get

reconciled_sample_label_transcription_0_5_6.csv

with two lines per subject where with 0.4.8 I have:

reconciled_sample_label_transcription_0_4_8.csv

I am hoping that both the csv or nfn format return the reconciled file with explanations available in the one line the same way, since the editor is set up to search out problems to fix based on the explanations

Re preceding fix for the title - workflow_name is still not an argument so title = f"Summary of '{args.workflow_csv}'" will always be the title often with args.workflow_csv blank,

rafelafrance commented 1 year ago
  1. ~Working on smerging the two rows.~
  2. ~The workflow name can be an argument. I'll leave the number out if it's not there~
  3. The user summary is a mess, it may take a while.
rafelafrance commented 1 year ago

How do the explanations and summary title look now?

The user summary is not ready.

PmasonFF commented 1 year ago

When I apply the script to the flattened .csv file everything works fine - the explanations are in columns after the reconciled columns but that works fine with my editor script so no problem.

When I apply the revised script to the raw data export from zooniverse things are not so perfect - note the column order - especially for tasks T12 and 14. For reasons unknown those columns follow the metadata. Only three subjects have at least one transcription for those tasks:

reconciled_sample_label_transcription_nfn.csv

What determines the column order?

When a subject has no responses for a task, normally the explanation says that, but for these columns which are out of place, subjects with no responses for those tasks show no explanation text either. This may be an issue with this workflow which allowed volunteers to skip some questions and transcription blocks entirely, and is also quite dated. It may have an older format for certain task types too, and certainly does for the subject_data column.

The reconciled with explanations file derived from the raw export still works with the editor ( ie it does not crash, and the fields can be edited) but as one would expect both the reconciled value and the explanations text are blank for the T12 and T14 tasks for those subject were the reconciled explanations are blank, and of course the columns are out of order as well in the editor.

Most project data I work with will be flattened, and I have very few raw data exports that are suitable for testing the nfn format case for these scripts. This one has no subjects where there are no responses at all for a task and it works fine ( though the task labels are too long and cumbersome) :

digging-up-the-oceans-past-classifications_cleaned.csv

rafelafrance commented 1 year ago

I just pushed some changes to the repo. They should handle most (all?) of the issues. You should be aware that you'll need to install a new library pip install -r requirements.txt.

I had to do major surgery on the bar chart. So that's going to look a bit different. The first 49 counts are all handled separately. The 50th one contains all counts of 50 and above. So if someone transcribed 1000 images that'll be in the 50+ bar.

There are other changes, there had to be given the bugs, but that's the one that'll stand out.

Please test thoroughly.

PmasonFF commented 1 year ago

I have begun testing V 0.6.0 Where should I report findings? here or open an issue or??

1) Summary title when running a csv format is coming up Summary of '' - ie does not default to file name

2) Possibly intentional - all blank transcription fields are now shown with pink highlight in the summary and included as Problems for the field.

3) Summary graph for big projects - binning for projects with large number of subjects and active volunteers ends up skewed to 50+ bin see example https://panoptes-uploads.zooniverse.org/subject_location/f00bff5d-6432-4b48-85d9-26d3d7da2cd8.jpeg possibly consider dynamic scaling based on count for most prolific volunteer? Previously the list of volunteers was sorted most classifications to least and the axis was scaled accordingly. In the past some projects ran their data through reconcile.py just for this graph - even though the project was NOT a transcription project.

4) Memory issues for big projects (around 600,000 classifications)- this is not a new issue but for very large data exports the summary file will result in memory errors and in some cases even if it completes the summary file will not show the reconciled or unreconciled data - the user table at the top builds, as does the graph, and the Reconciliation summary but there is no data shown in the reconciliation detail area. This is not a serious issue since for such large projects the summary is not much use for working with the output. The reconciled file with explanations seems to build fine even for large projects.

5) I do not have much experience working with nfn format exports, but a few I tried today had some problems, mostly due to the workflows involved. I see NfN generally uses very short task labels, and very simple consistent subject metadata. The issues I am seeing are for situations where the tasks are not in order, with looonnng task labels, and the metadata fields from one subject to the next are not perfectly consistent... I would say if the script is working for NfN it does not need refinement - most of the uses I have done are on flattened .csv files, and there, other than the points noted above is working fine for me.

rafelafrance commented 1 year ago
  1. ~I'll fix.~
  2. The change is intentional, however, it is possible to have a command line switch for this. The "problems list" is in a single constant which could get generated on the fly. If you want this please create a new issue.
  3. Hmm... i can have an option determine the number of bars. And/or I can have one for YOU selecting the bin size with the last one still holding the residuals. The outliers really mess up automatic bin generation. We can discuss this in a new issue.
  4. We can add another command line option to not generate the reconciled detail section of the summary report. Are you combining expeditions? Or using computer generated data? Because 600K sounds industrial, not the Zooniverse scale I'm accustom to. Even 30K transcripts seemed painfully slow to me. Fixing this would require another rewrite.
  5. We can discuss the problems here in a new issue.

TL;DR

rafelafrance commented 1 year ago

A lot of changes. Please check the output carefully.

PmasonFF commented 1 year ago

You have had a busy day! Most of the issues I have seen so far have been addressed in other issue threads. Thank You!

This tool is primarily for NfN. As long as it does the miminum I need for other projects I should be happy, not asking for fixes and enhancements, so I am a bit reluctant to raise issues.

I will be checking it out over the weekend and let you know if there is anything major....

BTW I announced the editor in the zooniverse Data Analysis forum https://www.zooniverse.org/talk/1322/2729009?page=1&scrollToLastComment=true. I hope the attribution to NfN and reconcile.py is sufficient. Please inform if I should include any other links.

rafelafrance commented 1 year ago

First of all, I have to say that the Reconciler Editor looks great.

As far as attribution, I got more than enuf attention. I can't speak for others tho.

I appreciate your reluctance and we don't want to get trivial requests but if something would make your or your clients life easier than make an issue.

The code got rough again so you may find issues, in which case, please do file a report.