Closed yaseminbridges closed 6 months ago
@yaseminbridges thanks for pointing this out. This probably related to a change in the way the individual name was generated from the input, as you have surmised above. I will check this!
I fixed ANKH. The problem was that we are generating file names automatically from the phenopacket id and PMID, and I think the pattern was changed at some point in the project. I am adding some QC code but we will need to check this in a more vigorous way in the future. I will leave this open until we have a general fix.
I think I have removed all of the duplicates and I added a function to check for possible duplicates to streamline checking in the future.
I have downloaded the
all_phenopackets.zip
from the 0.1.1 release and it appears there may be some duplicate phenopackets.Looking at the directory
COL3A1
I checked some of the contents of the phenopackets and they look almost identical except one version has the data regarding age and one does not, other than that the phenotypic profiles and the genomic information are the same. This was for the phenopackets that had similar naming, e.g.,PMID_36189931_35.json
&PMID_36189931_Individual35.json
had mostly the same content information (apart from age).Another example was in the
ANKH
directory where there are phenopackets namedPMID_22647861_41-year-old_woman.json
andPMID_22647861_41-year-oldwoman.json
.PMID_22647861_41-year-old_woman.json
only contains subject information whilePMID_22647861_41-year-oldwoman.json
contains all information (phenotypic profile & interpretations).There may be more occurrences as I haven't checked all the separate directories.