Closed SimonGreenhill closed 2 years ago
After the conflicts have been resolved, it seems to me that the best thing would be if there is only one sheet per language and that this sheet has all coders initials who have contributed in the filename. That's what I'm planning will happen post conflict-resolution. that sound alright?
That's one way to do it. Alternatively, we could keep the names of the "best" sheets, and add the coders of the merged data in the contributed_datapoints
column. This might be slightly more transparent.
@xrotwang Not fully sure I follow. We want to avoid situations later on when edits might need to be made and coders would have to check in on more than one sheet.
I think I'd rather keep these separate.
I think we have to weigh the following issues:
My proposal provides a bit of a compromise: We would have only one sheet per glottocode. But at least for the merged datapoints we could trace back easily where they came from. We'd lose this simple traceability for datapoints currently present in multiple sheets.
That said, since all is under version control, with a bit more effort provenance could be discovered in all scenarios.
I understand what @xrotwang means now. I thought this might be what you said before but I wasn't sure. I think merging to one file is a good idea.
There are some tricky things to consider with some sheets where there are extra columns that aren't "useful" as such for the GB data processing but which are part of that coders documentation. When merging in such cases it could get a bit messy.
No rush on this -- I don't want to change the data format for the coders again when we're creeping up to 2000 sheets and a first release. Perhaps as step 1 we can merge the non-conflicting data points during the creation of grambank-cldf?
Hm, I'd say if we do this, then rather before the first release.
I also think that there shouldn't be any important information in extra columns that will be hard to merge. Either we do know that such information exists - and then it should be in "official" columns - or we don't and then such information is already somewhat faded and keeping it only in historic versions of the repos will just remove it one more step.
There's no reason not to have multiple rows per feature in a language right? so we can handle conflicting things by something like
Feature_ID,Coder,Value
GB501,Hedvig,1
GB501,Robert,1
GB501,Simon,0
@SimonGreenhill I wouldn't do this. This would only make sense if we wanted to be able to compute stuff like inter-coder agreement from these sheets; but I think we are already past the point where this would have been an option, considering that
So the transparency which might be expected from the setup you propose is just not there.
We do need to be able to track inter-coder reliability as someone has a paper planned on this, so we don't want to lose that information.
@SimonGreenhill But that's only for the "controlled experiment", i.e. the set of sheets that have been part of the first trials, right? Maybe we should snapshot and store separately this set anyway? it may have been distorted already by changing the underlying feature set?
There are two different things going on here
a) coder inter-reliability paper (which was originally on Harald's table but has sort of been shifted to mine I believe) b) releasing a dataset without conflicts
We shouldn't have any new conflicts emerging, the workflow for assigning languages is more centralised now and the pilot phase is long over. The majority of conflicts were due to the pilot phase or Sahul / HG import. I've got Harald's report from the pilot phase if anyone wants it and I'm compiling a report on the full set.
The data for (a) we've currently got easy access to due to our structure in original_sheets. Do we want to continue having these conflicts stored or do we want to "purge" them from original_sheets (i.e. only visible in history)?
There are 154 glottocodes with at least one conflict, spread out over 332 sheets. My plan was to consolidate the multiple sheets into one sheet per glottocode with this kind of layout:
i) sheet with conflicts aggregated to glottocode. Filename: asdf1241.tsv
Feature_ID | Value | Contributed_datapoints | conflict? |
---|---|---|---|
GB501 | 1 | Hedvig | yes |
GB501 | 1 | Robert | yes |
GB501 | 0 | Simon | yes |
GB022 | 1 | Hedvig, Robert, Simon | no |
These sheets would then be re-examined by our experienced coders and they decide on a final coding, only concerning themselves with rows where it is "yes" in the "conflict?" column. Once all conflicts are merged, I was imagining it'd look like this (with Daniel being the example checker in this case:
ii) solved sheet. Filename: asdf1241.tsv
Feature_ID | Value | Contributed_datapoints |
---|---|---|
GB501 | 1 | Hedvig, Robert, Daniel |
GB022 | 1 | Hedvig, Robert, Simon |
During this process, I want to take the sheets of type (i) and store in a separate folder. This is the dataset we'd run inter-coder-reliability checks over. The resolved sheets would be put back into original_sheets. It'd also be good if all tsv sheets in original_sheets only had the glottocode in the filename and the coder information moved to the col "contributed_datapoints".
I think that this would be the easiest for the coders, only one place to "pick up" your load, only one place where coder is tracked (col contributed_datapoints and not also filename).
I realise that it would not be transparent and track history well since there would be a bunch of new files created all the time. I'm open to another approach which makes us of Git versioning properly or other kinds of improvement. This is just my suggestion at this time.
We can save the folder with sheets of type (i) indefinitely, but it would not be part of the CLDF-release.
I don't know what Robert means when he says that sheets "may have been distorted already by changing the underlying feature set". All features remain in all sheets, regardless of active status. Unless inactive feature rows were removed during the great tsvization?
Despite the fact that @HedvigS 's proposal means renaming all sheets, I'm ok with this.
@xrotwang haha that's nice to hear :D :D . If it's done right, there should be a way of logging it only as filename changes first and not a full deletion and addition.
Yes, only nitpick is that 'conflict?' is redundant as we know this from the data anyway.
@SimonGreenhill sure, agreed. I just wanted to make it easier for the coders when they open the file and have it clearly marked what they are to do. It can be difficult to visually see things like this sometimes, human error creeps in easily.
Cool, seems like we agree mostly then. I'll get back to this once we've reached 2,000 :)
Hi @xrotwang and @SimonGreenhill . We are nearing 2,000 and I'm making preparations for sorting out the conflicting data. Are we still okay for the plan that we discussed in this thread?
I propose specifically:
1) all tsv files in the grambank/original_sheets are renamed such that there is only one file per glottocode and coders are instead indicated in the "contributed_datapoints" column
2) Hedvig coordinates solving the conflicts, which are signalled by more than one row per feature and per language. Duplicates of conflicting sheets are stored in a separate folder for posteriority.
3) the cldf dump does not contain the best sheet, but all non-conflicting datapoints.
This requires changes to the coders workflow as well as the cldf dumping scripts. I have a meeting next week with Alena, Tobias and Jay. I'd like to go over this with them before we roll out step (1) and (2) with the coders, just so everyone is on board.
Once this is cleared with them, I think I can write an r script that does step (1) with some tidyverse gymnastics, unless you @xrotwang already have a plan.
Hi both, re (1) - do we need to rename the sheets? I'd like to not mess with the workflow at this point. We can handle the merging in the cldf generation code and keep the workflow and files the same. (2) can be logged during the cldf-generation process/
@SimonGreenhill Maybe. How to handle the resolved conflict sheets?
I have a suggestion, we combine the two (or more) sheets that are conflicting into one file and just tag on all the coder initials in the filename. The file unresolved would have more than one line per feature and language, and the resolved would have... either one line per feature and language OR the preferred coding as checked by our senior coder is somehow indicated, say in a separate column.
that sounds complicated. Just change the incorrect 'original sheet'
@SimonGreenhill I don't understand. There'll be more than one original sheet for languages with conflicting coding.
What if we just merge the resolved datapoints in the "best" sheet - indicating the responsible coder in the "conributed_datapoint" column? So we'd need a list of conflicts, then would only edit "best" sheets per glottocode with conflicts and the workflow could stay exactly as it is.
@xrotwang I think that's essentially what I proposed.
Oh, I see, the renaming. Sure. Why not, not a big issue for me
^ that sounds good to me @xrotwang (keeps the same workflow, etc)
Personally I think that indicating coders in two places (file name and cotributed_column) is messy, in particular when there is only one sheet per language, but I'm willing to be flexible.
How would we keep track of conflicts historically if the best sheet is edited?
Just the way other edits in the sheets are tracked: the git history will show that something changed, and further inspection would reveal what exactly.
So far during this project commits and file history have often been hard to understand from my perspective. I was made head of the coder inter reliability paper, and I think that digging through git history would become inconvenient later on honestly.
I'm fine with some other way besides renaming all sheets, but in that case I would actually prefer it if we kept a duplicate of all known duplicate coding somewhere else.
For simplicity for the coders, I think that one sheet = one language and one place where coders are indicated is easiest. If that one sheet keeps the filename of the "best" sheet, that shouldn't be much of a problem, but I think if we keep non-best sheets slushing around in the folders may get complicated.
How difficult is it for the cldf dumping process to ignore conflicting coding in the best sheet until the senior coders rechecking conflicts are done?
git log filename and git blame filename tell you everything e.g. https://github.com/glottobank/pygrambank/blame/master/src/pygrambank/api.py
Okay. I don't doubt that it can be done, I just don't know how to do it yet and I just suspect there may be further complications down the line that I don't know about now that makes it difficult to easily retrieve all duplicate coding at a certain point in time of the repos. I'm more than happy to be wrong, I'm just very sceptical right now of things working smoothly and would prefer a plan B backup which is a folder with a copy of all duplicate coding.
Haha, am I paranoid?
Tracing all different codings any datapoint has ever had might be somewhat ambitious - but should be doable. But note that we already bury some of that variation in the git history (in case a sheet is edited).
Just looked at the title of this issue: "merge in non-conflicting data". So I guess that case is simple, right? We just merge datapoints into the best sheet and add the contributors. Maybe we start doing this? Btw. did we have counts of conflicts? And do we know how many are "real" - i.e. conflicting interpretation of the sources rather than conflicts where one coding was automatically derived from another dataset?
This issue thread has become wider than the original intention. We do want to merge in non-conflicting data, yes. We also want to keep track of conflicts and solve them.
We can tell apart conflicts between merg-ins from ones we've "created ourselves" by the coders listed in the filename. At this point, we were looking at solving those as well but perhaps after the ones we have generated ourselves
Merging in non-conflicting data for the cldf dump should be fairly easy, I think? The crux is how to manage the workflow during the revisions. I think that for coders working through the conflicts, one sheet = one language is overall easier to deal with rather than having to look up the "best" sheet and only make changes there.
I'm having a meeting tomorrow with the other node leaders to talk about how we are coordinating over correcting coding conflicts. I'd like to use an approach where the coders only need to check one sheet per language and where there is more than one row per feature and language representing the conflict. They then consult the relevant sources and reduce the multiple rows to one which carries the appropriate coding.
As far the rest in terms of exact file names and how cldf dump are rendered I'm happy to do it whichever way you two deem most efficient.
I'm just about to check up on how many conflicts are a result of the HG import. I suggest we override all of those automatically with our own coding.
But if we go for such a restructuring - just one sheet per glottocode - then we should also limit the set of columns to a fixed known list. Otherwise merging will be harder and make the not-exactly defined columns even less transparent.
Hedvig Skirgård notifications@github.com schrieb am Di., 14. Juli 2020, 12:31:
I'm having a meeting tomorrow with the other node leaders to talk about how we are coordinating over correcting coding conflicts. I'd like to use an approach where the coders only need to check one sheet per language and where there is more than one row per feature and language representing the conflict. They then consult the relevant sources and reduce the multiple rows to one which carries the appropriate coding.
As far the rest in terms of exact file names and how cldf dump are rendered I'm happy to do it whichever way you two deem most efficient.
I'm just about to check up on how many conflicts are a result of the HG import. I suggest we override all of those automatically with our own coding.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/glottobank/pygrambank/issues/13#issuecomment-658104307, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKCW7S4YULL7PIM5PFDR3QXXTANCNFSM4MMHQXSA .
@xrotwang Sure. For now, I'm only considering the immediate urgent conflict matter and the sheets concerned there. I'd be willing to do that separately from the rest if we need more time to discuss an overall restructuring.
I'm counting 4935 conflicting codings for 154 different languoids. With so many (>300) affected sheets, I'd say we either restructure the whole repos now or nothing.
Excluding the sheets of the Hunter-Gatherer database we're at 107 languoids and 3733 conflicting datapoints.
Overall restructuring sounds good to me. If we three can agree on a precise method course of action today or tomorrow morning, I can get the coders started very soon.
Not sure what has happened, but the way I'm checking for duplicates comes up with 3,640 data points excluding auto translated ones.
Ok, here's a proposal:
original_sheets
will have only one sheet per glottocode, named abcd1234.tsv
with columns:
Feature_ID,Value,Source,Comment,Contributed_datapoints,Selected
where the Selected
column is a flag signaling which of conflicting values should be picked up in the CLDF.
legacy_data.tsv
with columns Sheet,Column,Value
.@HedvigS I'm checking for conflicts only - not duplicates. Also I'm ignoring empty values.
@xrotwang Love it! So, the coders checking conflicts would either type in something in the "selected" col or make a new row and type in something in the "selected" col (in case neither of existing values is correct)
Okay, as soon as this is also agreed on by @SimonGreenhill I'll start notifying people.
The more I think of this suggestion, Robert, the more I like it :D !
@HedvigS Note that the list of allowed columns in this proposal is rather short - so there may not be a lot of context available in the sheets to work from (e.g. no feature title or feature description).
Rather than simply taking the 'best' sheet into
grambank-cldf
, we should merge in non-conflicting work sheets too. Any data points that conflict should be logged somewhere so they can be checked (but still left out).