revise code to merge in non-conflicting data

SimonGreenhill commented 4 years ago

Rather than simply taking the 'best' sheet into grambank-cldf, we should merge in non-conflicting work sheets too. Any data points that conflict should be logged somewhere so they can be checked (but still left out).

HedvigS commented 4 years ago

After the conflicts have been resolved, it seems to me that the best thing would be if there is only one sheet per language and that this sheet has all coders initials who have contributed in the filename. That's what I'm planning will happen post conflict-resolution. that sound alright?

xrotwang commented 4 years ago

That's one way to do it. Alternatively, we could keep the names of the "best" sheets, and add the coders of the merged data in the contributed_datapoints column. This might be slightly more transparent.

HedvigS commented 4 years ago

@xrotwang Not fully sure I follow. We want to avoid situations later on when edits might need to be made and coders would have to check in on more than one sheet.

SimonGreenhill commented 4 years ago

I think I'd rather keep these separate.

xrotwang commented 4 years ago

I think we have to weigh the following issues:

Merging datapoints into a single sheet per glottocode will make it easier to maintain. With the current setup it could happen that datapoints are added to a previously "not-selected" sheet, thereby implictly overwriting datapoints once this sheet gains the biggest number of coded features.
Keeping sheets separate will make the provenance more explicit. This would be necessary, for example, if we were to re-run the import from the HunterGatherer data.

My proposal provides a bit of a compromise: We would have only one sheet per glottocode. But at least for the merged datapoints we could trace back easily where they came from. We'd lose this simple traceability for datapoints currently present in multiple sheets.

That said, since all is under version control, with a bit more effort provenance could be discovered in all scenarios.

HedvigS commented 4 years ago

I understand what @xrotwang means now. I thought this might be what you said before but I wasn't sure. I think merging to one file is a good idea.

There are some tricky things to consider with some sheets where there are extra columns that aren't "useful" as such for the GB data processing but which are part of that coders documentation. When merging in such cases it could get a bit messy.

SimonGreenhill commented 4 years ago

No rush on this -- I don't want to change the data format for the coders again when we're creeping up to 2000 sheets and a first release. Perhaps as step 1 we can merge the non-conflicting data points during the creation of grambank-cldf?

xrotwang commented 4 years ago

Hm, I'd say if we do this, then rather before the first release.

I also think that there shouldn't be any important information in extra columns that will be hard to merge. Either we do know that such information exists - and then it should be in "official" columns - or we don't and then such information is already somewhat faded and keeping it only in historic versions of the repos will just remove it one more step.

SimonGreenhill commented 4 years ago

There's no reason not to have multiple rows per feature in a language right? so we can handle conflicting things by something like

Feature_ID,Coder,Value
GB501,Hedvig,1
GB501,Robert,1
GB501,Simon,0

xrotwang commented 4 years ago

@SimonGreenhill I wouldn't do this. This would only make sense if we wanted to be able to compute stuff like inter-coder agreement from these sheets; but I think we are already past the point where this would have been an option, considering that

some sheet are batch-import, computed (in different ways over time) from other questionnaires,
some sheets have been corrected without keeping the old values in separate rows.

So the transparency which might be expected from the setup you propose is just not there.

SimonGreenhill commented 4 years ago

We do need to be able to track inter-coder reliability as someone has a paper planned on this, so we don't want to lose that information.

xrotwang commented 4 years ago

@SimonGreenhill But that's only for the "controlled experiment", i.e. the set of sheets that have been part of the first trials, right? Maybe we should snapshot and store separately this set anyway? it may have been distorted already by changing the underlying feature set?

HedvigS commented 4 years ago

There are two different things going on here

a) coder inter-reliability paper (which was originally on Harald's table but has sort of been shifted to mine I believe) b) releasing a dataset without conflicts

We shouldn't have any new conflicts emerging, the workflow for assigning languages is more centralised now and the pilot phase is long over. The majority of conflicts were due to the pilot phase or Sahul / HG import. I've got Harald's report from the pilot phase if anyone wants it and I'm compiling a report on the full set.

The data for (a) we've currently got easy access to due to our structure in original_sheets. Do we want to continue having these conflicts stored or do we want to "purge" them from original_sheets (i.e. only visible in history)?

There are 154 glottocodes with at least one conflict, spread out over 332 sheets. My plan was to consolidate the multiple sheets into one sheet per glottocode with this kind of layout:

i) sheet with conflicts aggregated to glottocode. Filename: asdf1241.tsv

Feature_ID	Value	Contributed_datapoints	conflict?
GB501	1	Hedvig	yes
GB501	1	Robert	yes
GB501	0	Simon	yes
GB022	1	Hedvig, Robert, Simon	no

These sheets would then be re-examined by our experienced coders and they decide on a final coding, only concerning themselves with rows where it is "yes" in the "conflict?" column. Once all conflicts are merged, I was imagining it'd look like this (with Daniel being the example checker in this case:

ii) solved sheet. Filename: asdf1241.tsv

Feature_ID	Value	Contributed_datapoints
GB501	1	Hedvig, Robert, Daniel
GB022	1	Hedvig, Robert, Simon

During this process, I want to take the sheets of type (i) and store in a separate folder. This is the dataset we'd run inter-coder-reliability checks over. The resolved sheets would be put back into original_sheets. It'd also be good if all tsv sheets in original_sheets only had the glottocode in the filename and the coder information moved to the col "contributed_datapoints".

I think that this would be the easiest for the coders, only one place to "pick up" your load, only one place where coder is tracked (col contributed_datapoints and not also filename).

I realise that it would not be transparent and track history well since there would be a bunch of new files created all the time. I'm open to another approach which makes us of Git versioning properly or other kinds of improvement. This is just my suggestion at this time.

We can save the folder with sheets of type (i) indefinitely, but it would not be part of the CLDF-release.

I don't know what Robert means when he says that sheets "may have been distorted already by changing the underlying feature set". All features remain in all sheets, regardless of active status. Unless inactive feature rows were removed during the great tsvization?

xrotwang commented 4 years ago

Despite the fact that @HedvigS 's proposal means renaming all sheets, I'm ok with this.

HedvigS commented 4 years ago

@xrotwang haha that's nice to hear :D :D . If it's done right, there should be a way of logging it only as filename changes first and not a full deletion and addition.

SimonGreenhill commented 4 years ago

Yes, only nitpick is that 'conflict?' is redundant as we know this from the data anyway.

HedvigS commented 4 years ago

@SimonGreenhill sure, agreed. I just wanted to make it easier for the coders when they open the file and have it clearly marked what they are to do. It can be difficult to visually see things like this sometimes, human error creeps in easily.

HedvigS commented 4 years ago

Cool, seems like we agree mostly then. I'll get back to this once we've reached 2,000 :)

HedvigS commented 4 years ago

Hi @xrotwang and @SimonGreenhill . We are nearing 2,000 and I'm making preparations for sorting out the conflicting data. Are we still okay for the plan that we discussed in this thread?

I propose specifically:

1) all tsv files in the grambank/original_sheets are renamed such that there is only one file per glottocode and coders are instead indicated in the "contributed_datapoints" column

2) Hedvig coordinates solving the conflicts, which are signalled by more than one row per feature and per language. Duplicates of conflicting sheets are stored in a separate folder for posteriority.

3) the cldf dump does not contain the best sheet, but all non-conflicting datapoints.

This requires changes to the coders workflow as well as the cldf dumping scripts. I have a meeting next week with Alena, Tobias and Jay. I'd like to go over this with them before we roll out step (1) and (2) with the coders, just so everyone is on board.

Once this is cleared with them, I think I can write an r script that does step (1) with some tidyverse gymnastics, unless you @xrotwang already have a plan.

SimonGreenhill commented 4 years ago

Hi both, re (1) - do we need to rename the sheets? I'd like to not mess with the workflow at this point. We can handle the merging in the cldf generation code and keep the workflow and files the same. (2) can be logged during the cldf-generation process/

HedvigS commented 4 years ago

@SimonGreenhill Maybe. How to handle the resolved conflict sheets?

I have a suggestion, we combine the two (or more) sheets that are conflicting into one file and just tag on all the coder initials in the filename. The file unresolved would have more than one line per feature and language, and the resolved would have... either one line per feature and language OR the preferred coding as checked by our senior coder is somehow indicated, say in a separate column.

SimonGreenhill commented 4 years ago

that sounds complicated. Just change the incorrect 'original sheet'

HedvigS commented 4 years ago

@SimonGreenhill I don't understand. There'll be more than one original sheet for languages with conflicting coding.

xrotwang commented 4 years ago

What if we just merge the resolved datapoints in the "best" sheet - indicating the responsible coder in the "conributed_datapoint" column? So we'd need a list of conflicts, then would only edit "best" sheets per glottocode with conflicts and the workflow could stay exactly as it is.

HedvigS commented 4 years ago

@xrotwang I think that's essentially what I proposed.

HedvigS commented 4 years ago

Oh, I see, the renaming. Sure. Why not, not a big issue for me

SimonGreenhill commented 4 years ago

^ that sounds good to me @xrotwang (keeps the same workflow, etc)

HedvigS commented 4 years ago

Personally I think that indicating coders in two places (file name and cotributed_column) is messy, in particular when there is only one sheet per language, but I'm willing to be flexible.

HedvigS commented 4 years ago

How would we keep track of conflicts historically if the best sheet is edited?

xrotwang commented 4 years ago

Just the way other edits in the sheets are tracked: the git history will show that something changed, and further inspection would reveal what exactly.

HedvigS commented 4 years ago

So far during this project commits and file history have often been hard to understand from my perspective. I was made head of the coder inter reliability paper, and I think that digging through git history would become inconvenient later on honestly.

I'm fine with some other way besides renaming all sheets, but in that case I would actually prefer it if we kept a duplicate of all known duplicate coding somewhere else.

For simplicity for the coders, I think that one sheet = one language and one place where coders are indicated is easiest. If that one sheet keeps the filename of the "best" sheet, that shouldn't be much of a problem, but I think if we keep non-best sheets slushing around in the folders may get complicated.

How difficult is it for the cldf dumping process to ignore conflicting coding in the best sheet until the senior coders rechecking conflicts are done?

SimonGreenhill commented 4 years ago

git log filename and git blame filename tell you everything e.g. https://github.com/glottobank/pygrambank/blame/master/src/pygrambank/api.py

HedvigS commented 4 years ago

Okay. I don't doubt that it can be done, I just don't know how to do it yet and I just suspect there may be further complications down the line that I don't know about now that makes it difficult to easily retrieve all duplicate coding at a certain point in time of the repos. I'm more than happy to be wrong, I'm just very sceptical right now of things working smoothly and would prefer a plan B backup which is a folder with a copy of all duplicate coding.

Haha, am I paranoid?

xrotwang commented 4 years ago

Tracing all different codings any datapoint has ever had might be somewhat ambitious - but should be doable. But note that we already bury some of that variation in the git history (in case a sheet is edited).

xrotwang commented 4 years ago

Just looked at the title of this issue: "merge in non-conflicting data". So I guess that case is simple, right? We just merge datapoints into the best sheet and add the contributors. Maybe we start doing this? Btw. did we have counts of conflicts? And do we know how many are "real" - i.e. conflicting interpretation of the sources rather than conflicts where one coding was automatically derived from another dataset?

HedvigS commented 4 years ago

This issue thread has become wider than the original intention. We do want to merge in non-conflicting data, yes. We also want to keep track of conflicts and solve them.

We can tell apart conflicts between merg-ins from ones we've "created ourselves" by the coders listed in the filename. At this point, we were looking at solving those as well but perhaps after the ones we have generated ourselves

HedvigS commented 4 years ago

Merging in non-conflicting data for the cldf dump should be fairly easy, I think? The crux is how to manage the workflow during the revisions. I think that for coders working through the conflicts, one sheet = one language is overall easier to deal with rather than having to look up the "best" sheet and only make changes there.

HedvigS commented 4 years ago

I'm having a meeting tomorrow with the other node leaders to talk about how we are coordinating over correcting coding conflicts. I'd like to use an approach where the coders only need to check one sheet per language and where there is more than one row per feature and language representing the conflict. They then consult the relevant sources and reduce the multiple rows to one which carries the appropriate coding.

As far the rest in terms of exact file names and how cldf dump are rendered I'm happy to do it whichever way you two deem most efficient.

I'm just about to check up on how many conflicts are a result of the HG import. I suggest we override all of those automatically with our own coding.

xrotwang commented 4 years ago

But if we go for such a restructuring - just one sheet per glottocode - then we should also limit the set of columns to a fixed known list. Otherwise merging will be harder and make the not-exactly defined columns even less transparent.

Hedvig Skirgård notifications@github.com schrieb am Di., 14. Juli 2020, 12:31:

I'm having a meeting tomorrow with the other node leaders to talk about how we are coordinating over correcting coding conflicts. I'd like to use an approach where the coders only need to check one sheet per language and where there is more than one row per feature and language representing the conflict. They then consult the relevant sources and reduce the multiple rows to one which carries the appropriate coding.

As far the rest in terms of exact file names and how cldf dump are rendered I'm happy to do it whichever way you two deem most efficient.

I'm just about to check up on how many conflicts are a result of the HG import. I suggest we override all of those automatically with our own coding.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/glottobank/pygrambank/issues/13#issuecomment-658104307, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKCW7S4YULL7PIM5PFDR3QXXTANCNFSM4MMHQXSA .

HedvigS commented 4 years ago

@xrotwang Sure. For now, I'm only considering the immediate urgent conflict matter and the sheets concerned there. I'd be willing to do that separately from the rest if we need more time to discuss an overall restructuring.

xrotwang commented 4 years ago

I'm counting 4935 conflicting codings for 154 different languoids. With so many (>300) affected sheets, I'd say we either restructure the whole repos now or nothing.

xrotwang commented 4 years ago

Excluding the sheets of the Hunter-Gatherer database we're at 107 languoids and 3733 conflicting datapoints.

HedvigS commented 4 years ago

Overall restructuring sounds good to me. If we three can agree on a precise method course of action today or tomorrow morning, I can get the coders started very soon.

HedvigS commented 4 years ago

Not sure what has happened, but the way I'm checking for duplicates comes up with 3,640 data points excluding auto translated ones.

xrotwang commented 4 years ago

Ok, here's a proposal:

original_sheets will have only one sheet per glottocode, named abcd1234.tsv with columns:
```
Feature_ID,Value,Source,Comment,Contributed_datapoints,Selected
```
where the Selected column is a flag signaling which of conflicting values should be picked up in the CLDF.
These sheets are created by merging coded datapoints from all corresponding current sheets. Columns which don't make it into the merged sheet could go into a legacy_data.tsv with columns Sheet,Column,Value.

xrotwang commented 4 years ago

@HedvigS I'm checking for conflicts only - not duplicates. Also I'm ignoring empty values.

HedvigS commented 4 years ago

@xrotwang Love it! So, the coders checking conflicts would either type in something in the "selected" col or make a new row and type in something in the "selected" col (in case neither of existing values is correct)

HedvigS commented 4 years ago

Okay, as soon as this is also agreed on by @SimonGreenhill I'll start notifying people.

HedvigS commented 4 years ago

The more I think of this suggestion, Robert, the more I like it :D !

xrotwang commented 4 years ago

@HedvigS Note that the list of allowed columns in this proposal is rather short - so there may not be a lot of context available in the sheets to work from (e.g. no feature title or feature description).

grambank / pygrambank

revise code to merge in non-conflicting data #13