juliema / label_reconciliations

Code for reconciling multiple transcriptions for a label
MIT License
26 stars 11 forks source link

-f csv requires subject_id while -f nfn requires subject_ids in input file #39

Closed PmasonFF closed 6 years ago

PmasonFF commented 6 years ago

I have not examined why, but when working with csv files. reconcile.py bombs unless the column holding the subject id numbers is called subject_id (no 's') while the standard nfn (or any zooniverse) classification file provides the header subject_ids (with 's'). Once the input file column header is "corrected" to subject_id it works fine.

rafelafrance commented 6 years ago

This isn't really a bug but it may appear inconsistent at first glance... maybe at second glance too.

The cause of this is the highly peculiar CSV files that we get from the Adler. I tried to isolate the all of the highly idiosyncratic code into the nfn.py module and then give the "--format csv" & "--format json" options obvious defaults. The code for "--format nfn" is its own beast and I don't want to make the rest of the code look or behave like this mess.

We're currently using the "-f csv" to get data out of ancient version 1 Notes from Nature.

FYI: You can change what column you want to group on with the "--group-by" option.

PS: Are you looking to use this to reconcile other Zooniverse projects? If so, then we can open an issue where we discuss what we/you could do to make it work. Like maybe another "--format" plugin or, better yet, make a "--format sql" that reads from the database directly. I'm busy but willing to help.

PmasonFF commented 6 years ago

Are you looking to use this to reconcile other Zooniverse projects?

Yes I am writing small blocks of code that can be added to a basic frame work to flatten any zooniverse classification file into a csv with the data in a more usable form for aggregation. For any task involving transcriptions the blocks produce an output file suitable to feed directly to reconcile.py see https://github.com/PmasonFF/Zooniverse-data-digging, specifically the files with transcription or trans in the filename. As there is a reason for it I will simply configure the output files to reconcile.py's needs. ie use subject_id rather than --group by subject_ids in the command line.

rafelafrance commented 6 years ago

NfN needs are slightly different from Zooniverse needs. The reason, specifically, is that Zooniverse data allows more than one subject per transcript and the reconciler script assumes there is only one subject per transcript. It's not a hard requirement but we'd have to change what we group by to accommodate multiple subject IDs per transcript, NBD.

This is an early point release, all options are still on the table. If it would prove to be more widely useful to the community then we can change to use subject IDs. It's a simple matter of converting arrays to strings vs plucking out the first element of an array.

PmasonFF commented 6 years ago

How useful all this is may be a valid question! Larger projects have teams with their own IT people. My efforts are directed at the smaller fish that have little IT support and are faced with the JSON annotations column, but want to use Excel, or just get the data in the simplest form.... At this point I can work with reconcile.py as it is.

Multiple subjects is simply a further complication on a more common issue- transcriptions made as a sub-tasks to a drawing tool. In both cases one needs to know the location in the subject(s) for the information being transcribed in order to group the various volunteer's transcriptions - example - a page of a ship's log may have three or more dates, plus temperatures and pressures for each day. Imagine a task to mark each date in green, each temperature in red and pressure in blue, each with a sub-task to transcribe the data. To reconcile the inputs we need to group the volunteer's data first by location then reconcile the grouped transcriptions. (Old Weather is doing just that but their code is more highly specialized and not so useful for a general project).

I have an general approach which will handle this - many projects need to cluster location information -that problem has a solution - all I have to do is take the clustered transcripts and output them in the format reconcile.py expects. It would be nice - but not likely worth much effort - if I could pass a list of text strings (eg the various volunteer's transcriptions from a specific location) and have the list reconciled directly, but for now I can rebuild the individual classification lines with the transcriptions split out in as simple text strings in columns by location (example the three dates that were all transcribed from the area near (x, y) of frame 0 are now cleanly isolated in one column) This can then feed directly to reconcile.py as it stands now. I just have to call the column that was subject_ids in the original classification subject_id. :)

rafelafrance commented 6 years ago

Interesting. Thanks for the info.

You clearly have a path forward. Closing. If you want to revisit this then we can reopen this issue or open an new one.