nfdi4plants / Swate-templates

A collection of templates for Swate (https://github.com/nfdi4plants/Swate).
4 stars 17 forks source link

[BUG] Found multiple iterations of the same tags in Swate #105

Closed Freymaurer closed 6 months ago

Freymaurer commented 7 months ago

This image lead me to open this issue:

image

In this image you can see 3 different PRIDE tags. One as Tag, two as ER_Tag. One of the ER_Tags has an id the other has not.

To clean up these things i ran some very simple analytics (results below). Would be nice if someone could clean this up 😄

Found ambiguous tag growth in:

Code

#r "nuget: ARCtrl, 1.0.7"
//.fsx file

let templates = 
  ARCtrl.Template.Web.getTemplates None |> Async.RunSynchronously

let distinctTags = ARCtrl.Template.Templates.getDistinctOntologyAnnotations (templates)
distinctTags.Length // 110

let groupedByName = distinctTags |> Array.groupBy (fun oa -> oa.NameText)
groupedByName.Length //104

let ambiguousTags = groupedByName |> Array.filter (fun (name, c) -> c.Length > 1)
ambiguousTags.Length // 6

for (name,tags) in ambiguousTags do
  let temps = ARCtrl.Template.Templates.filterByOntologyAnnotation (tags) templates
  printfn "## Found ambiguous tag `%s` in:" name
  for template in temps do 
    let authors = 
      template.Authors 
      |> Array.map (fun a -> 
        let names = [|a.FirstName; a.MidInitials; a.LastName|] |> Array.map (fun n -> Option.defaultValue "" n)
        sprintf "%s %s %s" names.[0] names.[1] names.[2]
      ) 
      |> String.concat ", "
    printfn "- **%s** by (*%s*)" (template.Name.Trim()) (authors.Trim())
StellaEggels commented 7 months ago

So the solution is adding an accession number to every tag? Another issue is tags that are not identical but similar. E.g. there is Plant, plant and Plants. I think adding accession numbers could help also here. I had been planning to discuss this in our upcoming meeting.

grafik

(For me the same tags are shown for ER and normal tags, but I guess that has been fixed already.)

Freymaurer commented 7 months ago

Another issue is tags that are not identical but similar. E.g. there is Plant, plant and Plants

I think this is a valid point. I am thinking about adding a qualitity control CI for pull requests which runs the code i used for my two issues today + a similiarity test for similiar words. Then before merging any PR one could see if these points are handled somewhat correctly.

What do you think about this? It would add another test to this:

image

StellaEggels commented 7 months ago

Sounds good to me

StellaEggels commented 7 months ago

I will start adding tag term accession numbers to the ambiguous terms from your check.

Freymaurer commented 7 months ago

The first iteration of fixes went through, therefore i am going to update the current state here. Please note, that we now also test for similiar tags. If you find a combination to be a true difference (which can be very likely) please notify me below, so i can either increase the similiarity threshold or whitelist a specific combination. The current similiarity threshold is 0.8.

Edit: I will try to improve the script so the output is less split.

Found similiar tags for plant growth protocol in: