monarch-initiative / phenopacket-store

Collection of phenopackets
https://monarch-initiative.github.io/phenopacket-store/
BSD 3-Clause "New" or "Revised" License
12 stars 4 forks source link

Phenopacket duplicates #86

Closed yaseminbridges closed 2 months ago

yaseminbridges commented 3 months ago

I have downloaded the all_phenopackets.zip from the 0.1.1 release and it appears there may be some duplicate phenopackets.

Looking at the directory COL3A1 I checked some of the contents of the phenopackets and they look almost identical except one version has the data regarding age and one does not, other than that the phenotypic profiles and the genomic information are the same. This was for the phenopackets that had similar naming, e.g., PMID_36189931_35.json & PMID_36189931_Individual35.json had mostly the same content information (apart from age).

Another example was in the ANKH directory where there are phenopackets named PMID_22647861_41-year-old_woman.json and PMID_22647861_41-year-oldwoman.json. PMID_22647861_41-year-old_woman.json only contains subject information while PMID_22647861_41-year-oldwoman.json contains all information (phenotypic profile & interpretations).

There may be more occurrences as I haven't checked all the separate directories.

pnrobinson commented 3 months ago

@yaseminbridges thanks for pointing this out. This probably related to a change in the way the individual name was generated from the input, as you have surmised above. I will check this!

pnrobinson commented 3 months ago

I fixed ANKH. The problem was that we are generating file names automatically from the phenopacket id and PMID, and I think the pattern was changed at some point in the project. I am adding some QC code but we will need to check this in a more vigorous way in the future. I will leave this open until we have a general fix.

pnrobinson commented 2 months ago

I think I have removed all of the duplicates and I added a function to check for possible duplicates to streamline checking in the future.