ncbo / bioportal-project

Serves to consolidate (in Zenhub) all public issues in BioPortal
BSD 2-Clause "Simplified" License
7 stars 5 forks source link

MedDRA: significant source file size decrease after 2020AB release #239

Open jvendetti opened 2 years ago

jvendetti commented 2 years ago

Contacted by an end user that reported a significant decrease in the size of the MedDRA ontology source file between releases. I looked at the file size for all submissions, and noticed a steady increase through the 2020AB release, after which there's a significant drop:

UMLS release BioPortal submission ID File size (in MB)
2021AB 19 62
2021AA 18 61.4
2020AB 17 246.1
2020AA 16 237.9
2019AB 15 232.8
2019AA 14 227.1
2018AB 13 222.2
2018AA 12 216.5
2017AB 11 211.8
2017AA 10 207.1
2016AB 9 202.3
2016AA 8 196
2015AB 7 191.6
2015AA 6 181.6
2014AB 5 177.4
2014AA 4 173.8
2013AB 3 164.1
2013AA 2 166.4
2012AB 1 166.4

I looked at the UMLS MDR statistics page for the 202AB release, which reports 71,603 lower level terms (which I assume are classes). The statistics page for the latest 2021AB release reports 73,991 lower level terms. There's nothing obvious in the UMLS documentation that would explain the file size decrease, considering that the number of terms went up.

The BioPortal REST API reports that we're serving 76,447 classes (https://data.bioontology.org/ontologies/MEDDRA/metrics).

alexskr commented 2 years ago

the only obvious difference I see is that in 2021AB TTL file terms do not have any "SIB" relationships unlike in previous versions.

MEDDRA]$ wc -l  16/MEDDRA.ttl
2594640 16/MEDDRA.ttl
MEDDRA]$ wc -l  19/MEDDRA.ttl
978517 19/MEDDRA.ttl
MEDDRA]$ grep SIB 16/MEDDRA.ttl | wc -l
1659383
MEDDRA]$ grep SIB 19/MEDDRA.ttl | wc -l
0
graybeal commented 2 years ago

remove the SIB lines from the 16 version and you're within 4.4% (43K lines) of the 19 version. Pretty suspicious, allowing for natural growth.