Closed d0choa closed 2 months ago
It might be a coincidence, but there are exactly 100 records between the first potential error and the second. This could point to some sort of pagination error.
Looking at the issue a bit deeper, I would say this is probably a purely data related issue affecting only this two pmd-s. None of the other evidence have such "truncated" pmids. The fact that the distance between the two entries is 100, is most likely by chance. (this entry has 600+ pmids, and there are number of evidence with multiple hundreads of supporting publications.
@f.udf(
t.ArrayType(
t.StructType([
t.StructField('pmid', t.StringType(), True),
t.StructField('index', t.IntegerType(), True)
])
)
)
def get_short_index(a):
# Get a list of short pmids + their index in the literature arrray:
return [{'pmid': v, 'index':i} for i, v in enumerate(a) if len(v) <= 4]
(
spark.read.json('gs://open-targets-pre-data-releases/23.09/input/evidence-files/cosmic.json.gz')
.filter(f.col('literature').isNotNull())
.select(
'diseaseFromSourceMappedId',
'targetFromSourceId',
f.size(f.col('literature')).alias('litCount'),
get_short_index(f.col('literature')).alias('shortPmids')
)
# Filter for evidence that contain "truncated" pmids:
.filter(f.size(f.col('shortPmids')) > 0)
.orderBy(f.col('litCount').desc())
.show(truncate=False)
)
Yields one single evidence:
+-------------------------+------------------+--------+--------------------------+
|diseaseFromSourceMappedId|targetFromSourceId|litCount|shortPmids |
+-------------------------+------------------+--------+--------------------------+
|EFO_0000571 |ENSG00000146648 |642 |[{2494, 227}, {2530, 328}]|
+-------------------------+------------------+--------+--------------------------+
I can email the COSMIC team to investigate/fix the truncated pmids (if that's the cause) and update the ticket. I would not be overly concerned.
I have emailed COSMIC and described the issue. Will keep you updated.
Reply from COSMIC:
Dear Annalisa,
Zbyslaw had a look at this, and apparently oracle truncates
extremely long lists of pubmed ids when grouping. We are still working on this,
but now the source of the issue has been identified we should have this fixed
for the November release (v99).
Thanks for highlighting this.
Kind regards, Dave
As of 20 November, there's no data update from COSMIC, the most recent file is from June this year, so I could not confirm the issue is resolved.
New version of COSMIC was released on 28/11/23. Data with next submission should be fixed
Hi @DSuveges , can we close this issue if the error has been fixed with the latest 24.03 release?
Yes, the new data is good.
@polrus and @mjfalaguera found some (apparently) incorrect publication references in COSMIC records.
In this example entry, 2 records "2494" and "2530" are valid PubMed records but we don't think they contain any information related to the evidence. We rather think these are truncated identifiers. For example, we believe it's possible "2494" referred to 24942490 | COSMIC paper.
Could we?
Full record: