Matches in literature data disappear at ETL

DSuveges commented 1 year ago

Describe the bug

There are 200M more matches in the raw data than in the normalized and failed datasets combined.

Failed matches count: 247_332_452
Grounded matches count: 444_674_668
All matches count in raw data: 935_616_523
Lost matches count: 243_609_403

Generating count

```python fmc = spark.read.parquet('gs://open-targets-pre-data-releases/23.06.littest/output/etl/parquet/failedMatches/').count() mc = spark.read.parquet('gs://open-targets-pre-data-releases/23.06.littest/output/etl/parquet/literature/matches').count() tmc = spark.read.parquet('gs://ot-team/dsuveges/empc_match_cooc_counts').select(f.sum(f.col('match_count')).alias('sum')).collect()[0]['sum'] print(f'Failed matches count: {fmc}') print(f'Grounded matches count: {mc}') print(f'All matches count in raw data: {tmc}') print(f'Lost matches count: {tmc - fmc - mc}') ```

Some of this can be explained by the fact that some of the matches from abstracts might be counted twice in the raw data counts (if full texts were available), but there are examples (eg 33369735) when an entire publication with 60 matches disappear completely.

Investigating example

```python # Looking up publication in failed matches: ( spark.read.parquet('gs://open-targets-pre-data-releases/23.06.littest/output/etl/parquet/failedMatches/') .filter(f.col('pmid') == pmid) .select('pmid','type','kind','section','label') .show() ) # returns empty # Looking up publication in matches: ( spark.read.parquet('gs://open-targets-pre-data-releases/23.06.littest/output/etl/parquet/literature/matches') .filter(f.col('pmid') == pmid) .select('pmid','type','kind','section','label') .show() ) # Returns empty # Looking up publication in the raw dataset: ( spark.read.parquet('gs://ot-team/dsuveges/empc_match_cooc_counts') .filter(f.col('pmid') == pmid) .show() # Returns one entry. ) ``` Raw dataset: ``` +--------+-----+----------+-----------+----------+ | pmid|pmcid| pubdate|match_count|cooc_count| +--------+-----+----------+-----------+----------+ |33369735| null|2021-01-09| 60| 27| +--------+-----+----------+-----------+----------+ ``` - Location of the raw data: `gs://otar025-epmc/ml02/abstract/2022_06_01/NMP_patch-31-05-2022-0.jsonl` - Extracted examples with matches and cooccurrences: [link](https://jsoneditoronline.org/#left=cloud.5e6be4888e3843eb9f880c5e904d4893)

A desired behaviour of the literature etl would be that all matches/cooccurrences that come in were represented in any of the output generated by the pipelines.

DSuveges commented 1 year ago

Is there any obscure, less obvious filter applied at publication level in the ETL that we are not aware of?

DSuveges commented 1 year ago

There are 2.3M unique pmids that were found in the input data with non-zero matches, but the pmid could not be found in the post etl matches or failed matches dataset.

Code

```python publication_w_matches = ( spark.read.parquet('gs://ot-team/dsuveges/empc_match_cooc_counts') .filter( # Filter for publication with pmid only: f.col('pmid').isNotNull() & # Filter for publication with at least one match (f.col('match_count') > 0) ) .select('pmid',f.concat_ws('/', f.col('input')).alias('rawFile')) .groupBy('pmid') .agg(f.collect_set(f.col('rawFile')).alias('rawFiles')) ) etl_output = ( spark.read.parquet('gs://open-targets-pre-data-releases/23.06.littest/output/etl/parquet/failedMatches/') .filter(f.col('pmid').isNotNull()) .select('pmid', f.lit(True).alias('failed')) .distinct() .join( ( spark.read.parquet('gs://open-targets-pre-data-releases/23.06.littest/output/etl/parquet/literature/matches') .filter(f.col('pmid').isNotNull()) .select('pmid', f.lit(True).alias('matches')) .distinct() ), on='pmid', how='outer' ) .select( 'pmid', f.when( f.col('matches').isNull(), 'failed', ).when( f.col('failed').isNull(), 'matches', ).otherwise('both').alias('source') ) ) joined = ( etl_output .join(publication_w_matches, on='pmid', how='outer') .persist() ) joined.filter(f.col('source').isNull()).show(truncate=False) ```

Examples:

+--------+------+--------------------------------------------------------------------------+
|pmid    |in etl|rawFiles                                                                  |
+--------+------+--------------------------------------------------------------------------+
|1000888 |null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-695.jsonl]    |
|10023464|null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-494.jsonl]    |
|10024167|null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-102.jsonl]    |
|10028210|null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-457.jsonl]    |
|10037815|null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-584.jsonl]    |
|10051645|null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-770.jsonl]    |
|10063669|null  |[gs://otar025-epmc/ml02/abstract/2023_01_31/NMP_patch-30-01-2023-24.jsonl]|
|10069807|null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-115.jsonl]    |
|10069813|null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-376.jsonl]    |
|10074101|null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-450.jsonl]    |
|10074198|null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-457.jsonl]    |
|10075916|null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-456.jsonl]    |
|100768  |null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-557.jsonl]    |
|10084526|null  |[gs://otar025-epmc/ml02/abstract/2023_01_31/NMP_patch-30-01-2023-12.jsonl]|
|10084986|null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-346.jsonl]    |
|10085228|null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-159.jsonl]    |
|10091655|null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-633.jsonl]    |
|10097392|null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-35.jsonl]     |
|10101196|null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-482.jsonl]    |
|1011171 |null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-731.jsonl]    |
+--------+------+--------------------------------------------------------------------------+

Examples where full text is also available:

+--------+------+----------------------------------------------------------------------------------------------------------------------------------------------------+
|pmid    |source|rawFiles                                                                                                                                            |
+--------+------+----------------------------------------------------------------------------------------------------------------------------------------------------+
|25397466|null  |[gs://otar025-epmc/ml02/fulltext/2022_10_18/NMP_patch-17-10-2022-200.jsonl, gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-538.jsonl]   |
|25475197|null  |[gs://otar025-epmc/ml02/fulltext/2022_10_18/NMP_patch-17-10-2022-217.jsonl, gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-744.jsonl]   |
|25815452|null  |[gs://otar025-epmc/ml02/fulltext/2022_10_19/NMP_patch-18-10-2022-191.jsonl, gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-170.jsonl]   |
|30323295|null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-303.jsonl, gs://otar025-epmc/ml02/fulltext/2022_03_23/NMP_patch-total-398.jsonl]        |
|30704369|null  |[gs://otar025-epmc/ml02/fulltext/2022_03_23/NMP_patch-total-534.jsonl, gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-374.jsonl]        |
|31300348|null  |[gs://otar025-epmc/ml02/fulltext/2022_03_23/NMP_patch-total-3052.jsonl, gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-797.jsonl]       |
|32977467|null  |[gs://otar025-epmc/ml02/fulltext/2022_03_23/NMP_patch-total-2031.jsonl, gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-34.jsonl]        |
|33365772|null  |[gs://otar025-epmc/ml02/fulltext/2022_03_23/NMP_patch-total-1697.jsonl, gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-141.jsonl]       |
|33926400|null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-317.jsonl, gs://otar025-epmc/ml02/fulltext/2022_03_23/NMP_patch-total-202.jsonl]        |
|34315510|null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-15.jsonl, gs://otar025-epmc/ml02/fulltext/2022_03_23/NMP_patch-total-2854.jsonl]        |
|34790526|null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-119.jsonl, gs://otar025-epmc/ml02/fulltext/2022_03_23/NMP_patch-total-913.jsonl]        |
|34960468|null  |[gs://otar025-epmc/ml02/fulltext/2022_03_23/NMP_patch-total-3209.jsonl, gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-388.jsonl]       |
|35831390|null  |[gs://otar025-epmc/ml02/fulltext/2022_11_16/NMP_patch-15-11-2022-249.jsonl, gs://otar025-epmc/ml02/abstract/2022_07_19/NMP_patch-18-07-2022-5.jsonl]|
|9197536 |null  |[gs://otar025-epmc/ml02/fulltext/2022_03_23/NMP_patch-total-2415.jsonl, gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-385.jsonl]       |
|17290646|null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-493.jsonl, gs://otar025-epmc/ml02/fulltext/2022_10_18/NMP_patch-17-10-2022-181.jsonl]   |
|26576811|null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-110.jsonl, gs://otar025-epmc/ml02/fulltext/2022_03_23/NMP_patch-total-3013.jsonl]       |
|2667972 |null  |[gs://otar025-epmc/ml02/fulltext/2022_03_23/NMP_patch-total-2983.jsonl, gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-83.jsonl]        |
|28770595|null  |[gs://otar025-epmc/ml02/abstract/2023_02_01/NMP_patch-31-01-2023-38.jsonl, gs://otar025-epmc/ml02/fulltext/2022_03_23/NMP_patch-total-2561.jsonl]   |
|2999160 |null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-496.jsonl, gs://otar025-epmc/ml02/fulltext/2022_03_23/NMP_patch-total-1838.jsonl]       |
|30550578|null  |[gs://otar025-epmc/ml02/abstract/2022_05_27/NMP_patch-total-158.jsonl, gs://otar025-epmc/ml02/fulltext/2022_03_23/NMP_patch-total-2700.jsonl]       |
+--------+------+----------------------------------------------------------------------------------------------------------------------------------------------------+

DSuveges commented 1 year ago

Exploring disappearing publications in the old releases

I was wondering if the lost matches were present in the ETL output of previous releases. If so, that would imply a stochastic effect, if there same publications were missing that would imply some sort of a rule bases effect. The analysis was difficult, because we don't have failed matches from previous releases, so I had to rely on the successfully grounded entities of the matches dataset.

Code

```python # Reading matches dataset from the previous release: etl_output_old = ( spark.read.parquet('gs://open-targets-data-releases/23.02/output/etl/parquet/literature/matches/') .filter(f.col('pmid').isNotNull()) .select('pmid', f.lit('matches').alias('source')) .distinct() ) joined_old = ( # Extract missing publications: joined.filter(f.col('source').isNull()) .select('pmid', f.lit('missing').alias('new_release')) .join(etl_output_old, on='pmid', how='left') .persist() ) joined_old.filter(f.col('source').isNotNull()).show() joined_old.filter(f.col('source').isNotNull()).count() ```

+--------+-----------+-----------+
|    pmid|new_release|old_release|
+--------+-----------+-----------+
| 1018100|    missing|      found|
|10395270|    missing|      found|
|10624003|    missing|      found|
|11150631|    missing|      found|
|11235943|    missing|      found|
| 1157038|    missing|      found|
|11666000|    missing|      found|
|12010164|    missing|      found|
|12062199|    missing|      found|
| 1209214|    missing|      found|
|12236686|    missing|      found|
|12684197|    missing|      found|
|12962678|    missing|      found|
|14675585|    missing|      found|
|15359638|    missing|      found|
|16110478|    missing|      found|
|16279325|    missing|      found|
|16329786|    missing|      found|
|16524394|    missing|      found|
|17209885|    missing|      found|
+--------+-----------+-----------+

The test shows 34k of the currently missing publications had matches that were successfully grounded and can be found in the matches dataset. Compared to the 2.3M this is only a small fraction (~1%), however the list of failed matches are missing and also the problem most likely persisted across past releases.

mbdebian commented 2 months ago

@remo87 , would you mind checking with @DSuveges on the latest metrics for this? Thanks!

DSuveges commented 2 months ago

@mbdebian I'm afraid it seems, this problem still persist. When comparing the pmids in matches dataset from 23.02 vs 24.03 I got this:

+----------+--------+
|    source|   count|
+----------+--------+
|23.02 only| 1268782|
|      both|11348715|
|24.03 only| 6994834|
+----------+--------+

Although there's a 7M increase, there's a drop of 1.2M. How deep do you want us to dig in this issue?

opentargets / issues

Matches in literature data disappear at ETL #2976

Exploring disappearing publications in the old releases