opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Issues with some COSMIC references #3119

Closed d0choa closed 2 months ago

d0choa commented 9 months ago

@polrus and @mjfalaguera found some (apparently) incorrect publication references in COSMIC records.

cat cosmic.json | jq 'select(.targetFromSourceId=="ENSG00000146648" and .diseaseFromSourceMappedId=="EFO_0000571")'

In this example entry, 2 records "2494" and "2530" are valid PubMed records but we don't think they contain any information related to the evidence. We rather think these are truncated identifiers. For example, we believe it's possible "2494" referred to 24942490 | COSMIC paper.

Could we?

  1. diagnose the issue on the COSMIC side. whether it's a manual error affecting some records or a programmatic error
  2. Fix the reported issue.

Full record:

{
  "mutatedSamples": [
    {
      "functionalConsequenceId": "SO_0001539",
      "numberSamplesWithMutationType": 46,
      "numberMutatedSamples": 15819,
      "numberSamplesTested": 50794
    },
    {
      "functionalConsequenceId": "SO_0001587",
      "numberSamplesWithMutationType": 6,
      "numberMutatedSamples": 15819,
      "numberSamplesTested": 50794
    },
    {
      "functionalConsequenceId": "SO_0001583",
      "numberSamplesWithMutationType": 8087,
      "numberMutatedSamples": 15819,
      "numberSamplesTested": 50794
    },
    {
      "functionalConsequenceId": "SO_0001825",
      "numberSamplesWithMutationType": 6860,
      "numberMutatedSamples": 15819,
      "numberSamplesTested": 50794
    },
    {
      "functionalConsequenceId": "SO_0001059",
      "numberSamplesWithMutationType": 4711,
      "numberMutatedSamples": 15819,
      "numberSamplesTested": 50794
    },
    {
      "functionalConsequenceId": "SO_0001589",
      "numberSamplesWithMutationType": 13,
      "numberMutatedSamples": 15819,
      "numberSamplesTested": 50794
    },
    {
      "functionalConsequenceId": "SO_0001605",
      "numberSamplesWithMutationType": 266,
      "numberMutatedSamples": 15819,
      "numberSamplesTested": 50794
    }
  ],
  "literature": [
    "24575772",
    "24362878",
    "23852459",
    "19887873",
    "29594878",
    "19376842",
    "22707299",
    "22982663",
    "26749488",
    "27794398",
    "23372947",
    "24457237",
    "19057270",
    "24926557",
    "30922580",
    "29308976",
    "18549475",
    "22108465",
    "22773041",
    "22523180",
    "30341016",
    "23937608",
    "27694386",
    "15958609",
    "19063875",
    "29506987",
    "16133419",
    "22943430",
    "23632273",
    "24158511",
    "17047654",
    "22333554",
    "25848356",
    "25134330",
    "29338938",
    "20307913",
    "23344087",
    "23842453",
    "29858019",
    "24768118",
    "28939152",
    "22677909",
    "24105277",
    "24195468",
    "24555578",
    "17321325",
    "22335887",
    "20630828",
    "23493804",
    "28481359",
    "31561203",
    "22458769",
    "23683536",
    "20430469",
    "27900369",
    "17047648",
    "15623594",
    "17473653",
    "21575252",
    "29673089",
    "25179728",
    "19010870",
    "25615851",
    "20146086",
    "23733853",
    "24303521",
    "23468066",
    "22722798",
    "21516483",
    "20502057",
    "23337026",
    "19276259",
    "21411993",
    "24568474",
    "24412619",
    "21422421",
    "22993320",
    "24707260",
    "16467080",
    "26273378",
    "19117057",
    "23262782",
    "22263102",
    "24279718",
    "31534501",
    "23079729",
    "15710947",
    "17201173",
    "17409961",
    "23486266",
    "23579627",
    "22052230",
    "23242437",
    "18751405",
    "20186026",
    "23425899",
    "26870223",
    "18334834",
    "23392229",
    "16870303",
    "16921488",
    "23033341",
    "20705455",
    "25120214",
    "22142557",
    "25189529",
    "16564920",
    "21622214",
    "24179496",
    "29681454",
    "16983123",
    "22710815",
    "21224376",
    "23024022",
    "15625379",
    "22733594",
    "23134665",
    "23796143",
    "16567021",
    "29704676",
    "23507602",
    "24758910",
    "18083107",
    "17548126",
    "25345567",
    "20548248",
    "25726043",
    "22588158",
    "19704257",
    "20855837",
    "23714228",
    "21498706",
    "17699786",
    "15899142",
    "22674612",
    "18448998",
    "23540867",
    "24835218",
    "19881244",
    "19963121",
    "22896669",
    "19241901",
    "30121391",
    "23313172",
    "17565015",
    "22325357",
    "22045881",
    "18794545",
    "22983065",
    "20828860",
    "19020901",
    "23629442",
    "20624322",
    "27998968",
    "25456362",
    "24440279",
    "24403481",
    "24443522",
    "18640945",
    "21986139",
    "20668451",
    "23435014",
    "27257132",
    "22806307",
    "23590575",
    "21129809",
    "16552419",
    "29245278",
    "15118125",
    "17335935",
    "22102479",
    "21274533",
    "22975805",
    "24789720",
    "22504767",
    "15870831",
    "25328676",
    "27247954",
    "25202264",
    "21917678",
    "28625641",
    "16052537",
    "24894944",
    "18087280",
    "22209037",
    "22005476",
    "22895145",
    "27105424",
    "23136191",
    "23963360",
    "21368495",
    "22333630",
    "24457318",
    "21167064",
    "18021415",
    "18000506",
    "24990411",
    "16382114",
    "19596957",
    "26612314",
    "24842519",
    "19137110",
    "21168239",
    "22836650",
    "16115929",
    "24419411",
    "22848293",
    "16533793",
    "17685931",
    "21080748",
    "24353160",
    "22313637",
    "23088930",
    "23470965",
    "19692680",
    "23352033",
    "23969006",
    "21508367",
    "24300726",
    "16863509",
    "17192902",
    "17649787",
    "16740761",
    "18449007",
    "15780185",
    "19640859",
    "22190593",
    "2494",
    "17487277",
    "22726919",
    "21858063",
    "23362162",
    "23495083",
    "22899358",
    "17047397",
    "23986053",
    "22082647",
    "15728811",
    "27105513",
    "25112956",
    "15681531",
    "23419122",
    "20035424",
    "17062680",
    "19155283",
    "17150109",
    "25264883",
    "17626639",
    "17145836",
    "21623281",
    "21318227",
    "19272767",
    "21626329",
    "26199566",
    "21030925",
    "23408463",
    "22826471",
    "16198442",
    "21408138",
    "20473935",
    "23466741",
    "21107288",
    "20855820",
    "16144918",
    "17945377",
    "19724844",
    "21498705",
    "25687872",
    "27304188",
    "17761979",
    "24169259",
    "18379370",
    "24813888",
    "23052173",
    "19584155",
    "18594314",
    "19096302",
    "23807543",
    "22753836",
    "21111508",
    "22185996",
    "18789554",
    "19884551",
    "21945923",
    "21135146",
    "23645738",
    "19059670",
    "29483495",
    "20409020",
    "20682976",
    "22157931",
    "19692773",
    "24852875",
    "17228019",
    "21830212",
    "21372829",
    "21635547",
    "22753908",
    "24236184",
    "24866168",
    "19755773",
    "23275780",
    "24828666",
    "19096301",
    "22225786",
    "15604253",
    "20615575",
    "20008635",
    "23021771",
    "18785203",
    "19002495",
    "16407879",
    "21290211",
    "24137465",
    "15788655",
    "22220151",
    "23621221",
    "21151896",
    "23790173",
    "23212424",
    "15851406",
    "18186961",
    "21921847",
    "21575212",
    "22133747",
    "18258923",
    "17410004",
    "22228822",
    "2530",
    "22964709",
    "23907151",
    "23139670",
    "17508947",
    "23683537",
    "18985444",
    "17387741",
    "19723643",
    "19844187",
    "17368623",
    "18317075",
    "19692934",
    "21052000",
    "23945384",
    "24501009",
    "16827805",
    "24653640",
    "25152623",
    "24729716",
    "22740920",
    "19096323",
    "17409866",
    "20837450",
    "16467085",
    "17848912",
    "21729655",
    "15738541",
    "23341890",
    "23014527",
    "18261621",
    "17504988",
    "18090579",
    "17180521",
    "24197981",
    "22329199",
    "22858793",
    "21102258",
    "23800712",
    "21982684",
    "24742923",
    "26729443",
    "20881644",
    "21252721",
    "18676744",
    "17941001",
    "19671738",
    "21881358",
    "21899495",
    "20150826",
    "23410901",
    "19884861",
    "20155428",
    "21102267",
    "21681119",
    "22622260",
    "24676429",
    "24649318",
    "18450321",
    "20018398",
    "23892415",
    "15329413",
    "23783797",
    "25130612",
    "24453288",
    "31422893",
    "17060940",
    "18325048",
    "16002952",
    "18676761",
    "15761868",
    "23261230",
    "21062932",
    "17618013",
    "22947115",
    "20975376",
    "24810493",
    "19096324",
    "22560922",
    "17020982",
    "24419415",
    "21317745",
    "21769434",
    "24419753",
    "17332333",
    "18992959",
    "21062933",
    "23211219",
    "18418018",
    "20459863",
    "21030498",
    "17085664",
    "24788590",
    "22797155",
    "24336155",
    "18379357",
    "21227397",
    "24468202",
    "22302407",
    "24908064",
    "21949883",
    "22975558",
    "22722787",
    "24811487",
    "22673630",
    "17695517",
    "21995391",
    "22579408",
    "24429877",
    "20855974",
    "17784875",
    "29368620",
    "23559152",
    "24722163",
    "19362747",
    "17051834",
    "24126395",
    "25521406",
    "23434352",
    "29721166",
    "24942894",
    "22006985",
    "19088172",
    "24965407",
    "23403410",
    "24570539",
    "16785471",
    "21944773",
    "21497370",
    "22157369",
    "20823418",
    "16052218",
    "16733218",
    "20559149",
    "17561305",
    "33420836",
    "24040454",
    "21788562",
    "23919423",
    "20871266",
    "17001163",
    "24389444",
    "26164066",
    "29100434",
    "26599269",
    "23154768",
    "23439505",
    "25056302",
    "22836289",
    "22005472",
    "21532509",
    "20423982",
    "21841502",
    "25047674",
    "24449147",
    "21622546",
    "23261229",
    "21132006",
    "18441512",
    "25103305",
    "16234532",
    "18458038",
    "21610522",
    "24034463",
    "23566546",
    "19276157",
    "18948947",
    "22457323",
    "23897956",
    "24675505",
    "25904052",
    "19517135",
    "17537621",
    "25227801",
    "18478265",
    "24707263",
    "29731638",
    "20637128",
    "17284372",
    "20813423",
    "21315472",
    "24055406",
    "17904685",
    "17908804",
    "23709419",
    "17505415",
    "15492241",
    "20808254",
    "21572125",
    "21469767",
    "24376723",
    "24389445",
    "17409930",
    "24369725",
    "21681971",
    "17785547",
    "26340530",
    "24251405",
    "28762784",
    "23721103",
    "22712764",
    "22980975",
    "23334261",
    "18089646",
    "21273060",
    "20491778",
    "17409975",
    "21707848",
    "16931592",
    "31558231",
    "20207772",
    "16613660",
    "17577030",
    "24992725",
    "22333382",
    "16014893",
    "21573178",
    "29335443",
    "24029120",
    "24594201",
    "23486275",
    "15118073",
    "16775247",
    "21591457",
    "23788756",
    "25146938",
    "27588476",
    "25282218",
    "23449277",
    "22129360",
    "25300933",
    "23621919",
    "16140420",
    "17596643",
    "25567908",
    "22696596",
    "23945392",
    "23070249",
    "19381876",
    "20498546",
    "17917328",
    "23609009",
    "22912354",
    "20837451",
    "20803030",
    "24109538",
    "16105816",
    "17625570",
    "27499993",
    "22087096",
    "18946755",
    "24700479",
    "20811949",
    "23912954",
    "24482415",
    "26647728",
    "27056568",
    "24370197",
    "25674250",
    "22670114",
    "19589612",
    "18172267",
    "24793378",
    "24357744",
    "19179899",
    "25120716",
    "19763916",
    "18628075",
    "23036980",
    "21040950",
    "23961259",
    "21430269",
    "18448997",
    "23371856",
    "24511003",
    "19724887",
    "22483783",
    "17573511",
    "25376516",
    "17410005",
    "25117816",
    "23619604",
    "27545006",
    "19487967",
    "20009465",
    "22449692",
    "29110841",
    "20009914",
    "16043828",
    "20592359",
    "22019513",
    "29773459",
    "17889960",
    "24722155",
    "23320223",
    "24994671",
    "18403609",
    "25076254",
    "29616327",
    "19369630",
    "27923066",
    "20522446",
    "16619582",
    "26862733",
    "22071596",
    "18176089",
    "19238633",
    "24396447",
    "31386689",
    "18957054",
    "18271876",
    "20489150",
    "18827621",
    "19643949"
  ],
  "resourceScore": 1,
  "datasourceId": "cancer_gene_census",
  "datatypeId": "somatic_mutation",
  "targetFromSourceId": "ENSG00000146648",
  "studyId": "417",
  "diseaseFromSourceMappedId": "EFO_0000571"
}
d0choa commented 9 months ago

It might be a coincidence, but there are exactly 100 records between the first potential error and the second. This could point to some sort of pagination error.

DSuveges commented 9 months ago

Looking at the issue a bit deeper, I would say this is probably a purely data related issue affecting only this two pmd-s. None of the other evidence have such "truncated" pmids. The fact that the distance between the two entries is 100, is most likely by chance. (this entry has 600+ pmids, and there are number of evidence with multiple hundreads of supporting publications.

@f.udf(
    t.ArrayType(
        t.StructType([
            t.StructField('pmid', t.StringType(), True),
            t.StructField('index', t.IntegerType(), True)
        ])
    )
)
def get_short_index(a):
    # Get a list of short pmids + their index in the literature arrray:
    return [{'pmid': v, 'index':i} for i, v in enumerate(a) if len(v) <= 4]

(
    spark.read.json('gs://open-targets-pre-data-releases/23.09/input/evidence-files/cosmic.json.gz')
    .filter(f.col('literature').isNotNull())
    .select(
        'diseaseFromSourceMappedId', 
        'targetFromSourceId',
        f.size(f.col('literature')).alias('litCount'),
        get_short_index(f.col('literature')).alias('shortPmids')

    )
    # Filter for evidence that contain "truncated" pmids:
    .filter(f.size(f.col('shortPmids')) > 0)
    .orderBy(f.col('litCount').desc())
    .show(truncate=False)
)

Yields one single evidence:

+-------------------------+------------------+--------+--------------------------+
|diseaseFromSourceMappedId|targetFromSourceId|litCount|shortPmids                |
+-------------------------+------------------+--------+--------------------------+
|EFO_0000571              |ENSG00000146648   |642     |[{2494, 227}, {2530, 328}]|
+-------------------------+------------------+--------+--------------------------+

I can email the COSMIC team to investigate/fix the truncated pmids (if that's the cause) and update the ticket. I would not be overly concerned.

buniello commented 9 months ago

I have emailed COSMIC and described the issue. Will keep you updated.

buniello commented 8 months ago

Reply from COSMIC:

Dear Annalisa, 

Zbyslaw had a look at this, and apparently oracle truncates
extremely long lists of pubmed ids when grouping. We are still working on this,
but now the source of the issue has been identified we should have this fixed
for the November release (v99).

Thanks for highlighting this. 
Kind regards, Dave
DSuveges commented 8 months ago

As of 20 November, there's no data update from COSMIC, the most recent file is from June this year, so I could not confirm the issue is resolved.

buniello commented 7 months ago

New version of COSMIC was released on 28/11/23. Data with next submission should be fixed

prashantuniyal02 commented 3 months ago

Hi @DSuveges , can we close this issue if the error has been fixed with the latest 24.03 release?

DSuveges commented 2 months ago

Yes, the new data is good.