Closed GoogleCodeExporter closed 9 years ago
Even though I can't change the priority of this Issue - I have to say it is
rather "HIGH".
Right now we simply do not know which Articles can be seen which not.
Which articles can not be found in the query, because they do not exist and
which because they are simply not to be found (see the bug).
Hopefully the issue will be fixed soon, thanks!
Original comment by jakob-mo...@univie.ac.at
on 2 Oct 2013 at 7:52
priority changed to high.
Original comment by carinaja...@nieuwsmonitor.net
on 2 Oct 2013 at 7:56
the article IDs of the articles that are not visible in the codingjobs overview
also don't show up when you search for them using an article ID query. They do
however show up when you export coding job results.
This is really annoying when we want to split articles, if possible please fix
this soon.
Original comment by carinaja...@nieuwsmonitor.net
on 2 Oct 2013 at 9:26
The problem is diagnosed and we think we know how to proceed.
The cause is the deduplicator. AmCAT more or less automatically deduplicates
article sets, and this is done on meta-info (headline, date, medium) only.
Especially for splitted articles, this caused problems as articles were removed
after being split. So, even though the splitted articles were coded (and hence
found in the exported codings), they were no longer in the correct set.
As a solution, we (=martijn) will do three things, hopefully all finished
tonight:
1) the deduplicator will check the full text before actually removing articles.
This should prevent the problem from occurring again.
2) all coded articles that do not belong to the set of their coding job will be
re-added. This will cause coded articles to be found again.
3) all articles that no longer belong to any set ("orphans") in AUTNES will be
added to a new set, where we will try to differentiate between three sets of
articles:
- articles that are a real duplicate of an existing article
- articles that seem to be the original of split articles
- articles that are genuine orphans (ie that should not have been removed in
the first place)
Hopefully, 2 and 3 will make sure that all info you need will be available
again.
I am giving ownership of this issue to Martijn, he will update with progress on
these tasks.
Please comment if you think that this will not solve the issue, or if there is
anything else we need to be aware of!
Original comment by vanatteveldt@gmail.com
on 2 Oct 2013 at 3:40
Strict deduplicate-code has been added to ArticleSet:
http://code.google.com/p/amcat/source/detail?r=7da23d889f21dc7c99253d3a7a06a4aac
a0f1ae1
I will integrate this code on production tomorrow.
Original comment by Martijn....@gmail.com
on 2 Oct 2013 at 10:37
Deduplication code is now available in production, so this issue should not
reoccur. The sets have yet to be made.
Original comment by Martijn....@gmail.com
on 3 Oct 2013 at 10:24
Can the orphans be placed back into the autnes project soon? We need them for
the data analysis. Thanks!
Original comment by carinaja...@nieuwsmonitor.net
on 16 Oct 2013 at 1:29
I've found 935 articles which had codings which were removed due to the
deduplicator, with the following algorithm:
for coding in Coding.objects.filter(codingjob__project__id=50):
if coding.article not in coding.codingjob.articleset.articles.all():
wrongly_deduplicated.add(coding)
Discovered coding ids: https://gist.github.com/anonymous/e3bb859e8fb8ed5ef16f.
The articles belonging to these codings have been readded to their articlesets.
The articles should be accessible from Annotator once again.
Original comment by Martijn....@gmail.com
on 19 Oct 2013 at 10:46
He super, bedankt! Ik heb Jakob gevraagd te checken of alle artikelen die
hij miste weer terug zijn. Als ik het goed begreep van Wouter zijn er ook
nog artikelen verdwenen zonder codings, klopt dat, of waren dit alle
artikelen?
Original comment by carinaja...@nieuwsmonitor.net
on 19 Oct 2013 at 11:06
Er zijn inderdaad ook artikelen verwijderd zonder codings. We proberen die nu
te herstellen, maar er lijken 1463632 wees-artikelen te zijn dus we zoeken nog
naar een oplossing.
Original comment by Martijn....@gmail.com
on 19 Oct 2013 at 11:44
Goed, de resultaten zijn binnen. Er waren anderhalf miljoen artikelen die niet
tot een set behoorde, maar wel ooit in het project hebben gezeten. We hebben
deze artikelen vergeleken met alle artikelen die nog wel in het project zaten.
1. Artikelen die exact overeen kwamen (text, headline, medium, date) zijn in
articleset 5470[1] geplaatst.
2. Artikelen met een metadata match (headline, medium, date) zijn in articleset
5471[2] geplaatst.
3. Alle overige artikelen staan in articleset 5469[3].
Omdat er veel artikelen waren met lengte nul, hebben we besloten op deze
eigenschap te filteren en ze direct te verwijderen. Carina: kun jij dit
doorgeven aan Jakob?
Mochten er nog verdere acties ondernomen moeten worden, dan hoor ik het graag.
[1] http://amcat-production.labs.vu.nl/navigator/project/50/articleset/5470
[2] http://amcat-production.labs.vu.nl/navigator/project/50/articleset/5471
[3] http://amcat-production.labs.vu.nl/navigator/project/50/articleset/5469
Original comment by Martijn....@gmail.com
on 19 Oct 2013 at 6:47
O_O anderhalf miljoen.. Maar super bedankt, die indeling in 3 articlesets
is heel handig zo! Ik geef het door aan Jakob en de rest.
Op 19 oktober 2013 20:47 schreef <amcat@googlecode.com>:
Original comment by carinaja...@nieuwsmonitor.net
on 20 Oct 2013 at 10:58
Original issue reported on code.google.com by
carinaja...@nieuwsmonitor.net
on 28 Sep 2013 at 6:55