ivanhercaz / CanaryBot

CanaryBot works in the Wikimedia projects mainly in maintenance tasks!
1 stars 0 forks source link

Remove fullstop automatically when it finds the same description #10

Open ivanhercaz opened 5 years ago

ivanhercaz commented 5 years ago

This idea is to save time clicking to remove exactly the same description that already has been approved to be removed. For example:

This new action would save time and improve the efficiency of the script, but it couldn't be used with all the descriptions, because it would overload the system for nothing. It has to be used for a specific kind of descriptions, like the one mentioned.

How would it work

The script would have a new action: Remove identical full stops automatically. This new action will add the description to a CSV, previously created and loaded with only one column named sentence. But, before to add it the script has to confirm if the description is already in the CSV: if it is in the CSV, it isn't added, if it isn't, it is added.

This descriptions saved in this CSV would be useful for the next times the script would be run, because the script would read this document, which would storage the old descriptions marked to find and remove automatically and the new ones.

Tasks

ivanhercaz commented 5 years ago

But... how useful is to find and delete the exact description? It might be more effective if the system adds to the csv the last word with the full stop. Keeping this in mind, this action could be inserted in the first option ("Remove full stop"), avoiding to create another option. With the actual way the script would make the next steps:

Grade II listed building in Newport City. Located approximately 40 metres SW of Pound-wern Cottage. Bridge carries footpath connecting Ridgeway with the canal towpath.

  1. Remove duplicates automatically.
  2. Delete the full stop in the current item and add it to the csv file.
  3. If it finds the exact description again, it will delete it.

In this way, the script won't delete the description if something differs from the one added. But, if we save the last word and not all the description, the script would make:

  1. Remove full stop.
  2. Delete the full stop in the current item and add the last word, towpath, to the csv file.
  3. If the script finds any other descriptions that ends with towpath., it will delete it.

With this we save time because:

  1. We have not to think in more options (remove, checklist, edit, skip). The script would add the last word with the basic instruction to remove the full stop.
  2. The script would be more "intelligent" as we use it because it would add all the last words with full stops that we consider necessary to remove. In addition, the script will skip, as it does now, the type of words/full stops added in the exceptions list.

Thus, over time it would work with less and less human intervention.

Of course, another things to keep in mind:


@davidabian, I know all this issue is very long, specifically this last comment, but I would like to know your opinion about the reformulation of the system to save time (and make CanaryBot more "intelligent" :bird: ). Of course, thanks in advance!

davidabian commented 5 years ago

This is more of a linguistic issue and I can't expect what the results will be in the languages I don't know deeply; I guess in some languages this will cause too many false positives to be considered, while in Spanish or English this can work (but only if the bot is operated carefully, since a single mistake by the bot operator could be spread to several unrelated entities).

ivanhercaz commented 5 years ago

@davidabian, at this time all the full stops need to be confirmed to be removed from the description. The criterion to remove it is to be sure that it isn't an abbreviation or something that needs to have the full stop. In addition to that exists the exceptions lists, with it the script avoid the descriptions that match the pattern of any exception.

What kind of human mistakes do you refer? Mark a full stop to remove when it is part of an abbreviation because then the script would remove automatically? If this is the kind of mistakes do you mean, of course, the operator needs to be sure of what is doing.

In the case of the languages I follow the same rule than in the other: Is or isn't the full stop necessary? Of course, if the operator has a doubt in any of the language in which the script work, the operator has the option "Add description to checklist". Then, the operator might review the checklist to ask in Wikidata what could be the good option to choose, or in the case it is part of an abbreviation, create another regex in the exception list.