codingjob overview does not list all articles included

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. go to the overview of codingjobs (e.g. 
http://amcat.vu.nl/navigator/project/50/codingjobs)
2. the number in the column n_articles is supposed to show the number of 
articles in that codingjob, but this number is not correct. Sometimes 
n_codings_done is higher than n_articles, for example in this codingjob: 
http://amcat.vu.nl/navigator/project/50/codingjob/1551, where n_articles is 16 
and n_codings done is 23
3. when I export the results of this codingjob, it turns out that there are 24 
articles in the codingjob. So why does n_articles say 16, and does the table 
'articles included' in this screen 
http://amcat.vu.nl/navigator/project/50/codingjob/1551 only list 16 articles?

Original issue reported on code.google.com by carinaja...@nieuwsmonitor.net on 28 Sep 2013 at 6:55

GoogleCodeExporter commented 9 years ago

Even though I can't change the priority of this Issue - I have to say it is 
rather "HIGH".
Right now we simply do not know which Articles can be seen which not.
Which articles can not be found in the query, because they do not exist and 
which because they are simply not to be found (see the bug).

Hopefully the issue will be fixed soon, thanks!

Original comment by jakob-mo...@univie.ac.at on 2 Oct 2013 at 7:52

GoogleCodeExporter commented 9 years ago

priority changed to high.

Original comment by carinaja...@nieuwsmonitor.net on 2 Oct 2013 at 7:56

Added labels: Priority-High
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

the article IDs of the articles that are not visible in the codingjobs overview 
also don't show up when you search for them using an article ID query. They do 
however show up when you export coding job results.

This is really annoying when we want to split articles, if possible please fix 
this soon.

Original comment by carinaja...@nieuwsmonitor.net on 2 Oct 2013 at 9:26

GoogleCodeExporter commented 9 years ago

The problem is diagnosed and we think we know how to proceed.

The cause is the deduplicator. AmCAT more or less automatically deduplicates 
article sets, and this is done on meta-info (headline, date, medium) only. 
Especially for splitted articles, this caused problems as articles were removed 
after being split. So, even though the splitted articles were coded (and hence 
found in the exported codings), they were no longer in the correct set.

As a solution, we (=martijn) will do three things, hopefully all finished 
tonight:

1) the deduplicator will check the full text before actually removing articles. 
This should prevent the problem from occurring again.

2) all coded articles that do not belong to the set of their coding job will be 
re-added. This will cause coded articles to be found again.

3) all articles that no longer belong to any set ("orphans") in AUTNES will be 
added to a new set, where we will try to differentiate between three sets of 
articles:

- articles that are a real duplicate of an existing article
- articles that seem to be the original of split articles
- articles that are genuine orphans (ie that should not have been removed in 
the first place)

Hopefully, 2 and 3 will make sure that all info you need will be available 
again. 

I am giving ownership of this issue to Martijn, he will update with progress on 
these tasks. 

Please comment if you think that this will not solve the issue, or if there is 
anything else we need to be aware of!

Original comment by vanatteveldt@gmail.com on 2 Oct 2013 at 3:40

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Strict deduplicate-code has been added to ArticleSet:

http://code.google.com/p/amcat/source/detail?r=7da23d889f21dc7c99253d3a7a06a4aac
a0f1ae1

I will integrate this code on production tomorrow.

Original comment by Martijn....@gmail.com on 2 Oct 2013 at 10:37

GoogleCodeExporter commented 9 years ago

Deduplication code is now available in production, so this issue should not 
reoccur. The sets have yet to be made.

Original comment by Martijn....@gmail.com on 3 Oct 2013 at 10:24

GoogleCodeExporter commented 9 years ago

Can the orphans be placed back into the autnes project soon? We need them for 
the data analysis. Thanks!

Original comment by carinaja...@nieuwsmonitor.net on 16 Oct 2013 at 1:29

GoogleCodeExporter commented 9 years ago

I've found 935 articles which had codings which were removed due to the 
deduplicator, with the following algorithm:

for coding in Coding.objects.filter(codingjob__project__id=50):
  if coding.article not in coding.codingjob.articleset.articles.all():
    wrongly_deduplicated.add(coding)

Discovered coding ids: https://gist.github.com/anonymous/e3bb859e8fb8ed5ef16f. 
The articles belonging to these codings have been readded to their articlesets. 
The articles should be accessible from Annotator once again.

Original comment by Martijn....@gmail.com on 19 Oct 2013 at 10:46

GoogleCodeExporter commented 9 years ago

He super, bedankt! Ik heb Jakob gevraagd te checken of alle artikelen die
hij miste weer terug zijn. Als ik het goed begreep van Wouter zijn er ook
nog artikelen verdwenen zonder codings, klopt dat, of waren dit alle
artikelen?

Original comment by carinaja...@nieuwsmonitor.net on 19 Oct 2013 at 11:06

GoogleCodeExporter commented 9 years ago

Er zijn inderdaad ook artikelen verwijderd zonder codings. We proberen die nu 
te herstellen, maar er lijken 1463632 wees-artikelen te zijn dus we zoeken nog 
naar een oplossing.

Original comment by Martijn....@gmail.com on 19 Oct 2013 at 11:44

GoogleCodeExporter commented 9 years ago

Goed, de resultaten zijn binnen. Er waren anderhalf miljoen artikelen die niet 
tot een set behoorde, maar wel ooit in het project hebben gezeten. We hebben 
deze artikelen vergeleken met alle artikelen die nog wel in het project zaten.

1. Artikelen die exact overeen kwamen (text, headline, medium, date) zijn in 
articleset 5470[1] geplaatst.

2. Artikelen met een metadata match (headline, medium, date) zijn in articleset 
5471[2] geplaatst.

3. Alle overige artikelen staan in articleset 5469[3].

Omdat er veel artikelen waren met lengte nul, hebben we besloten op deze 
eigenschap te filteren en ze direct te verwijderen. Carina: kun jij dit 
doorgeven aan Jakob?

Mochten er nog verdere acties ondernomen moeten worden, dan hoor ik het graag.

[1] http://amcat-production.labs.vu.nl/navigator/project/50/articleset/5470
[2] http://amcat-production.labs.vu.nl/navigator/project/50/articleset/5471
[3] http://amcat-production.labs.vu.nl/navigator/project/50/articleset/5469

Original comment by Martijn....@gmail.com on 19 Oct 2013 at 6:47

Changed state: Done

GoogleCodeExporter commented 9 years ago

O_O anderhalf miljoen.. Maar super bedankt, die indeling in 3 articlesets
is heel handig zo! Ik geef het door aan Jakob en de rest.

Op 19 oktober 2013 20:47 schreef <amcat@googlecode.com>:

Original comment by carinaja...@nieuwsmonitor.net on 20 Oct 2013 at 10:58

google-code-export / amcat

codingjob overview does not list all articles included #596